AI in the ED: Assessing the efficacy of GPT models vs. physicians in medical score calculation.

Haim, Gal Ben; Braun, Adi; Eden, Haggai; Burshtein, Livnat; Barash, Yiftach; Irony, Avinoah; Klang, Eyal

Haim, Gal Ben; Braun, Adi; Eden, Haggai; Burshtein, Livnat; Barash, Yiftach; Irony, Avinoah; Klang, Eyal.

Afiliación

Haim GB; Department of Emergency Medicine, Sheba Medical Center, Ramat-Gan, Israel; Tel Aviv University, Sackler Faculty of Medicine, Tel Aviv, Israel. Electronic address: galushbh@gmail.com.
Braun A; Department of Emergency Medicine, Sheba Medical Center, Ramat-Gan, Israel; Tel Aviv University, Sackler Faculty of Medicine, Tel Aviv, Israel.
Eden H; Department of Emergency Medicine, Sheba Medical Center, Ramat-Gan, Israel; Tel Aviv University, Sackler Faculty of Medicine, Tel Aviv, Israel.
Burshtein L; Department of Emergency Medicine, Sheba Medical Center, Ramat-Gan, Israel.
Barash Y; Division of Diagnostic Imaging, Sheba Medical Center, Tel Hashomer, Israel; DeepVision Lab, Sheba Medical Center, Tel Hashomer, Israel; Tel Aviv University, Sackler Faculty of Medicine, Tel Aviv, Israel.
Irony A; Department of Emergency Medicine, Sheba Medical Center, Ramat-Gan, Israel; Tel Aviv University, Sackler Faculty of Medicine, Tel Aviv, Israel.
Klang E; Division of Diagnostic Imaging, Sheba Medical Center, Tel Hashomer, Israel; DeepVision Lab, Sheba Medical Center, Tel Hashomer, Israel; Tel Aviv University, Sackler Faculty of Medicine, Tel Aviv, Israel.

Am J Emerg Med ; 79: 161-166, 2024 05.

Article en En | MEDLINE | ID: mdl-38447503

ABSTRACT

ABSTRACT

BACKGROUND AND

AIMS:

Artificial Intelligence (AI) models like GPT-3.5 and GPT-4 have shown promise across various domains but remain underexplored in healthcare. Emergency Departments (ED) rely on established scoring systems, such as NIHSS and HEART score, to guide clinical decision-making. This study aims to evaluate the proficiency of GPT-3.5 and GPT-4 against experienced ED physicians in calculating five commonly used medical scores.

METHODS:

This retrospective study analyzed data from 150 patients who visited the ED over one week. Both AI models and two human physicians were tasked with calculating scores for NIH Stroke Scale, Canadian Syncope Risk Score, Alvarado Score for Acute Appendicitis, Canadian CT Head Rule, and HEART Score. Cohen's Kappa statistic and AUC values were used to assess inter-rater agreement and predictive performance, respectively.

RESULTS:

The highest level of agreement was observed between the human physicians (Kappa = 0.681), while GPT-4 also showed moderate to substantial agreement with them (Kappa values of 0.473 and 0.576). GPT-3.5 had the lowest agreement with human scorers. These results highlight the superior predictive performance of human expertise over the currently available automated systems for this specific medical outcome. Human physicians achieved a higher ROC-AUC on 3 of the 5 scores, but none of the differences were statistically significant.

CONCLUSIONS:

While AI models demonstrated some level of concordance with human expertise, they fell short in emulating the complex clinical judgments that physicians make. The study suggests that current AI models may serve as supplementary tools but are not ready to replace human expertise in high-stakes settings like the ED. Further research is needed to explore the capabilities and limitations of AI in emergency medicine.

Asunto(s)

Inteligencia Artificial; Médicos; Humanos; Canadá; Estudios Retrospectivos; Servicio de Urgencia en Hospital

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Médicos / Inteligencia Artificial Límite: Humans País/Región como asunto: America do norte Idioma: En Revista: Am J Emerg Med Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google