Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.

Xu, Pusheng; Chen, Xiaolan; Zhao, Ziwei; Shi, Danli

Xu, Pusheng; Chen, Xiaolan; Zhao, Ziwei; Shi, Danli.

Afiliação

Xu P; School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.
Chen X; School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.
Zhao Z; School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.
Shi D; School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China danli.shi@polyu.edu.hk.

Br J Ophthalmol ; 108(10): 1384-1389, 2024 Sep 20.

Article em En | MEDLINE | ID: mdl-38789133

ABSTRACT

ABSTRACT

PURPOSE:

To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images.

METHODS:

We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation.

RESULTS:

Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability.

CONCLUSION:

GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.

Assuntos

Benchmarking; Imagem Multimodal; Tomografia de Coerência Óptica; Humanos; Tomografia de Coerência Óptica/métodos; Reprodutibilidade dos Testes; Angiofluoresceinografia/métodos; Técnicas de Diagnóstico Oftalmológico; Oftalmopatias/diagnóstico por imagem; Oftalmopatias/diagnóstico; Oftalmoscopia/métodos; Aplicativos Móveis; Oftalmologistas

Palavras-chave

Imaging; Retina

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Benchmarking / Tomografia de Coerência Óptica / Imagem Multimodal Limite: Humans Idioma: En Revista: Br J Ophthalmol Ano de publicação: 2024 Tipo de documento: Article País de afiliação: China País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google