Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments.

Beaulieu-Jones, Brendin R; Berrigan, Margaret T; Shah, Sahaj; Marwaha, Jayson S; Lai, Shuo-Lun; Brat, Gabriel A

Beaulieu-Jones, Brendin R; Berrigan, Margaret T; Shah, Sahaj; Marwaha, Jayson S; Lai, Shuo-Lun; Brat, Gabriel A.

Afiliação

Beaulieu-Jones BR; Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. Electronic address: https://twitter.com/bratogram.
Berrigan MT; Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA.
Shah S; Geisinger Commonwealth School of Medicine, Scranton, PA.
Marwaha JS; Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan.
Lai SL; Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan.
Brat GA; Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. Electronic address: bbeaulieujones@gmail.com.

Surgery ; 175(4): 936-942, 2024 Apr.

Article em En | MEDLINE | ID: mdl-38246839

ABSTRACT

ABSTRACT

BACKGROUND:

Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.

METHODS:

We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries.

RESULTS:

A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions.

CONCLUSION:

Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

Assuntos

Inteligência Artificial; Cirurgiões; Humanos; Escolaridade; Benchmarking; Idioma

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Inteligência Artificial / Cirurgiões Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Revista: Surgery Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google