Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions.

Tran, Catherine G; Chang, Jeremy; Sherman, Scott K; De Andrade, James P

Tran, Catherine G; Chang, Jeremy; Sherman, Scott K; De Andrade, James P.

Affiliation

Tran CG; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa.
Chang J; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa.
Sherman SK; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa.
De Andrade JP; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa. Electronic address: james-deandrade@uiowa.edu.

J Surg Res ; 299: 329-335, 2024 Jul.

Article in En | MEDLINE | ID: mdl-38788470

ABSTRACT

ABSTRACT

INTRODUCTION:

Chat Generative Pretrained Transformer (ChatGPT) is a large language model capable of generating human-like text. This study sought to evaluate ChatGPT's performance on Surgical Council on Resident Education (SCORE) self-assessment questions.

METHODS:

General surgery multiple choice questions were randomly selected from the SCORE question bank. ChatGPT (GPT-3.5, April-May 2023) evaluated questions and responses were recorded.

RESULTS:

ChatGPT correctly answered 123 of 200 questions (62%). ChatGPT scored lowest on biliary (2/8 questions correct, 25%), surgical critical care (3/10, 30%), general abdomen (1/3, 33%), and pancreas (1/3, 33%) topics. ChatGPT scored higher on biostatistics (4/4 correct, 100%), fluid/electrolytes/acid-base (4/4, 100%), and small intestine (8/9, 89%) questions. ChatGPT answered questions with thorough and structured support for its answers. It scored 56% on ethics questions and provided coherent explanations regarding end-of-life discussions, communication with coworkers and patients, and informed consent. For many questions answered incorrectly, ChatGPT provided cogent, yet factually incorrect descriptions, including anatomy and steps of operations. In two instances, it gave a correct explanation but chose the wrong answer. It did not answer two questions, stating it needed additional information to determine the next best step in treatment.

CONCLUSIONS:

ChatGPT answered 62% of SCORE questions correctly. It performed better at questions requiring standard recall but struggled with higher-level questions that required complex clinical decision making, despite providing detailed responses behind its rationale. Due to its mediocre performance on this question set and sometimes confidently-worded, yet factually inaccurate responses, caution should be used when interpreting ChatGPT's answers to general surgery questions.

Subject(s)

General Surgery; Internship and Residency; Humans; General Surgery/education; Educational Measurement/methods; Educational Measurement/statistics & numerical data; United States; Clinical Competence/statistics & numerical data; Specialty Boards

Key words

ABSITE; Artificial intelligence; General surgery boards; Surgical education

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: General Surgery / Internship and Residency Limits: Humans Country/Region as subject: America do norte Language: En Journal: J Surg Res Year: 2024 Document type: Article

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google