Your browser doesn't support javascript.
loading
Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions.
Tran, Catherine G; Chang, Jeremy; Sherman, Scott K; De Andrade, James P.
Affiliation
  • Tran CG; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa.
  • Chang J; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa.
  • Sherman SK; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa.
  • De Andrade JP; Department of Surgery, University of Iowa Hospitals & Clinics, Iowa City, Iowa. Electronic address: james-deandrade@uiowa.edu.
J Surg Res ; 299: 329-335, 2024 Jul.
Article in En | MEDLINE | ID: mdl-38788470
ABSTRACT

INTRODUCTION:

Chat Generative Pretrained Transformer (ChatGPT) is a large language model capable of generating human-like text. This study sought to evaluate ChatGPT's performance on Surgical Council on Resident Education (SCORE) self-assessment questions.

METHODS:

General surgery multiple choice questions were randomly selected from the SCORE question bank. ChatGPT (GPT-3.5, April-May 2023) evaluated questions and responses were recorded.

RESULTS:

ChatGPT correctly answered 123 of 200 questions (62%). ChatGPT scored lowest on biliary (2/8 questions correct, 25%), surgical critical care (3/10, 30%), general abdomen (1/3, 33%), and pancreas (1/3, 33%) topics. ChatGPT scored higher on biostatistics (4/4 correct, 100%), fluid/electrolytes/acid-base (4/4, 100%), and small intestine (8/9, 89%) questions. ChatGPT answered questions with thorough and structured support for its answers. It scored 56% on ethics questions and provided coherent explanations regarding end-of-life discussions, communication with coworkers and patients, and informed consent. For many questions answered incorrectly, ChatGPT provided cogent, yet factually incorrect descriptions, including anatomy and steps of operations. In two instances, it gave a correct explanation but chose the wrong answer. It did not answer two questions, stating it needed additional information to determine the next best step in treatment.

CONCLUSIONS:

ChatGPT answered 62% of SCORE questions correctly. It performed better at questions requiring standard recall but struggled with higher-level questions that required complex clinical decision making, despite providing detailed responses behind its rationale. Due to its mediocre performance on this question set and sometimes confidently-worded, yet factually inaccurate responses, caution should be used when interpreting ChatGPT's answers to general surgery questions.
Subject(s)
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: General Surgery / Internship and Residency Limits: Humans Country/Region as subject: America do norte Language: En Journal: J Surg Res Year: 2024 Document type: Article

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: General Surgery / Internship and Residency Limits: Humans Country/Region as subject: America do norte Language: En Journal: J Surg Res Year: 2024 Document type: Article