Búsqueda | BVS Bolivia

Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings.

Ishida, Masanori; Gonoi, Wataru; Nyunoya, Keisuke; Abe, Hiroyuki; Shirota, Go; Okimoto, Naomasa; Fujimoto, Kotaro; Kurokawa, Mariko; Nakai, Motoki; Saito, Kazuhiro; Ushiku, Tetsuo; Abe, Osamu.

Cureus ; 16(8): e67306, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-39301343

RESUMEN

INTRODUCTION: This study evaluates the diagnostic performance of the latest large language models (LLMs), GPT-4o (OpenAI, San Francisco, CA, USA) and Claude 3 Opus (Anthropic, San Francisco, CA, USA), in determining causes of death from medical histories and postmortem CT findings. METHODS: We included 100 adult cases whose postmortem CT scans were diagnosable for the causes of death using the gold standard of autopsy results. Their medical histories and postmortem CT findings were compiled, and clinical and imaging diagnoses of both the underlying and immediate causes of death, as well as their personal information, were carefully separated from the database to be shown to the LLMs. Both GPT-4o and Claude 3 Opus generated the top three differential diagnoses for each of the underlying or immediate causes of death based on the following three prompts: 1) medical history only; 2) postmortem CT findings only; and 3) both medical history and postmortem CT findings. The diagnostic performance of the LLMs was compared using McNemar's test. RESULTS: For the underlying cause of death, GPT-4o achieved primary diagnostic accuracy rates of 78%, 72%, and 78%, while Claude 3 Opus achieved 72%, 56%, and 75% for prompts 1, 2, and 3, respectively. Including any of the top three differential diagnoses, GPT-4o's accuracy rates were 92%, 90%, and 92%, while Claude 3 Opus's rates were 93%, 69%, and 93% for prompts 1, 2, and 3, respectively. For the immediate cause of death, GPT-4o's primary diagnostic accuracy rates were 55%, 58%, and 62%, while Claude 3 Opus's rates were 60%, 62%, and 63% for prompts 1,2, and 3, respectively. For any of the top three differential diagnoses, GPT-4o's accuracy rates were 88% for prompt 1 and 91% for prompts 2 and 3, whereas Claude 3 Opus's rates were 92% for all three prompts. Significant differences between the models were observed for prompt two in diagnosing the underlying cause of death (p = 0.03 and <0.01 for the primary and top three differential diagnoses, respectively). CONCLUSION: Both GPT-4o and Claude 3 Opus demonstrated relatively high performance in diagnosing both the underlying and immediate causes of death using medical histories and postmortem CT findings.

Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.

Kurokawa, Ryo; Ohizumi, Yuji; Kanzawa, Jun; Kurokawa, Mariko; Sonoda, Yuki; Nakamura, Yuta; Kiguchi, Takao; Gonoi, Wataru; Abe, Osamu.

Jpn J Radiol ; 2024 Aug 03.

Artículo en Inglés | MEDLINE | ID: mdl-39096483

RESUMEN

PURPOSE: The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology's Diagnosis Please quiz questions. MATERIALS AND METHODS: In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology's "Diagnosis Please" cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar's test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. RESULTS: The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model (p < 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p = 0.028). CONCLUSION: Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models' ability to identify important differential diagnoses under these conditions was also confirmed.

Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases.

Schmidl, Benedikt; Hütten, Tobias; Pigorsch, Steffi; Stögbauer, Fabian; Hoch, Cosima C; Hussain, Timon; Wollenberg, Barbara; Wirth, Markus.

Eur Arch Otorhinolaryngol ; 2024 Aug 07.

Artículo en Inglés | MEDLINE | ID: mdl-39112556

RESUMEN

OBJECTIVES: Head and neck squamous cell carcinoma (HNSCC) is a complex malignancy that requires a multidisciplinary tumor board approach for individual treatment planning. In recent years, artificial intelligence tools have emerged to assist healthcare professionals in making informed treatment decisions. This study investigates the application of the newly published LLM Claude 3 Opus compared to the currently most advanced LLM ChatGPT 4.0 for the diagnosis and therapy planning of primary HNSCC. The results were compared to that of a conventional multidisciplinary tumor board; (2) Materials and Methods: We conducted a study in March 2024 on 50 consecutive primary head and neck cancer cases. The diagnostics and MDT recommendations were compared to the Claude 3 Opus and ChatGPT 4.0 recommendations for each patient and rated by two independent reviewers for the following parameters: clinical recommendation, explanation, and summarization in addition to the Artificial Intelligence Performance Instrument (AIPI); (3) Results: In this study, Claude 3 achieved better scores for the diagnostic workup of patients than ChatGPT 4.0 and provided treatment recommendations involving surgery, chemotherapy, and radiation therapy. In terms of clinical recommendations, explanation and summarization Claude 3 scored similar to ChatGPT 4.0, listing treatment recommendations which were congruent with the MDT, but failed to cite the source of the information; (4) Conclusion: This study is the first analysis of Claude 3 for primary head and neck cancer cases and demonstrates a superior performance in the diagnosis of HNSCC than ChatGPT 4.0 and similar results for therapy recommendations. This marks the advent of a newly launched advanced AI model that may be superior to ChatGPT 4.0 for the assessment of primary head and neck cancer cases and may assist in the clinical diagnostic and MDT setting.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.

Sonoda, Yuki; Kurokawa, Ryo; Nakamura, Yuta; Kanzawa, Jun; Kurokawa, Mariko; Ohizumi, Yuji; Gonoi, Wataru; Abe, Osamu.

Jpn J Radiol ; 2024 Jul 01.

Artículo en Inglés | MEDLINE | ID: mdl-38954192

RESUMEN

PURPOSE: Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts. MATERIALS AND METHODS: Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane's Q and post hoc McNemar's tests. RESULTS: The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models. CONCLUSION: Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

RESUMEN

RESUMEN

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA