Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.

Sonoda, Yuki; Kurokawa, Ryo; Nakamura, Yuta; Kanzawa, Jun; Kurokawa, Mariko; Ohizumi, Yuji; Gonoi, Wataru; Abe, Osamu

Sonoda, Yuki; Kurokawa, Ryo; Nakamura, Yuta; Kanzawa, Jun; Kurokawa, Mariko; Ohizumi, Yuji; Gonoi, Wataru; Abe, Osamu.

Afiliación

Sonoda Y; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
Kurokawa R; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
Nakamura Y; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
Kanzawa J; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
Kurokawa M; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
Ohizumi Y; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
Gonoi W; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan. gonoiw@gmail.com.
Abe O; Department of Radiology, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.

Jpn J Radiol ; 2024 Jul 01.

Article en En | MEDLINE | ID: mdl-38954192

ABSTRACT

ABSTRACT

PURPOSE:

Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts. MATERIALS AND

METHODS:

Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane's Q and post hoc McNemar's tests.

RESULTS:

The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models.

CONCLUSION:

Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

Palabras clave

Artificial intelligence; ChatGPT; Claude 3 opus; GPT-4o; Gemini 1.5 pro; Large language model

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: Jpn J Radiol Asunto de la revista: DIAGNOSTICO POR IMAGEM / RADIOLOGIA / RADIOTERAPIA Año: 2024 Tipo del documento: Article País de afiliación: Japón

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google