Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.

Nguyen, Teresa P; Carvalho, Brendan; Sukhdeo, Hannah; Joudi, Kareem; Guo, Nan; Chen, Marianne; Wolpaw, Jed T; Kiefer, Jesse J; Byrne, Melissa; Jamroz, Tatiana; Mootz, Allison A; Reale, Sharon C; Zou, James; Sultan, Pervez

Nguyen, Teresa P; Carvalho, Brendan; Sukhdeo, Hannah; Joudi, Kareem; Guo, Nan; Chen, Marianne; Wolpaw, Jed T; Kiefer, Jesse J; Byrne, Melissa; Jamroz, Tatiana; Mootz, Allison A; Reale, Sharon C; Zou, James; Sultan, Pervez.

Afiliación

Nguyen TP; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.
Carvalho B; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.
Sukhdeo H; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.
Joudi K; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.
Guo N; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.
Chen M; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.
Wolpaw JT; Department of Anesthesiology and Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
Kiefer JJ; Department of Anesthesiology and Critical Care Medicine, University of Pennsylvania School of Medicine, Philadelphia, PA, USA.
Byrne M; Department of Anesthesiology, Perioperative and Pain Medicine, University of Michigan Ann Arbor School of Medicine, Ann Arbor, MI, USA.
Jamroz T; Department of Anesthesiology, Perioperative and Pain Medicine, Cleveland Clinic Foundation and Hospitals, Cleveland, OH, USA.
Mootz AA; Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Harvard School of Medicine, Boston, MA, USA.
Reale SC; Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Harvard School of Medicine, Boston, MA, USA.
Zou J; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Sultan P; Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.

BJA Open ; 10: 100280, 2024 Jun.

Article en En | MEDLINE | ID: mdl-38764485

ABSTRACT

ABSTRACT

Background:

Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.

Methods:

Ten frequently asked questions in anaesthesia were posed to three AI chatbots ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).

Results:

ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores 4 [3-4], 4 [3-4], and 3 [2-4], respectively; P<0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] vs Bing Chat, 2 [2-3], P<0.001; and Bard 3 [2-4] vs Bing Chat, 2 [2-3], P=0.002). All large language model chatbots performed well with no statistical difference for understandability (P=0.24), empathy (P=0.032), and ethics (P=0.465).

Conclusions:

In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.

Palabras clave

Bing Chat; GPT; Google Bard; anaesthesia frequently asked questions; artificial intelligence; chatbot; large language models

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Idioma: En Revista: BJA Open Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Idioma: En Revista: BJA Open Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos