The Comparative Diagnostic Capability of Large Language Models in Otolaryngology.

Warrier, Akshay; Singh, Rohan; Haleem, Afash; Zaki, Haider; Eloy, Jean Anderson

Warrier, Akshay; Singh, Rohan; Haleem, Afash; Zaki, Haider; Eloy, Jean Anderson.

Afiliação

Warrier A; Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A.
Singh R; Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A.
Haleem A; Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A.
Zaki H; Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A.
Eloy JA; Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A.

Laryngoscope ; 134(9): 3997-4002, 2024 Sep.

Article em En | MEDLINE | ID: mdl-38563415

ABSTRACT

ABSTRACT

OBJECTIVES:

Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology.

METHODS:

We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023.

RESULTS:

ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002).

CONCLUSIONS:

ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses. LEVEL OF EVIDENCE 3 Laryngoscope, 1343997-4002, 2024.

Assuntos

Otolaringologia; Humanos; Otolaringologia/métodos; Idioma

Palavras-chave

Bing AI; ChatGPT; Google Bard; artificial intelligence; diagnostic accuracy; large language models; otolaryngology

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Otolaringologia Limite: Humans Idioma: En Revista: Laryngoscope Assunto da revista: OTORRINOLARINGOLOGIA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google