Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.

Khalpey, Zain; Kumar, Ujjawal; King, Nicholas; Abraham, Alyssa; Khalpey, Amina H

Khalpey, Zain; Kumar, Ujjawal; King, Nicholas; Abraham, Alyssa; Khalpey, Amina H.

Affiliation

Khalpey Z; Khalpey AI Lab, Department of Cardiothoracic Surgery, HonorHealth, Scottsdale, USA.
Kumar U; Department of Research, Applied & Translational AI Research Institute (ATARI), Scottsdale, USA.
King N; Khalpey AI Lab, Department of Cardiothoracic Surgery, HonorHealth, Scottsdale, USA.
Abraham A; School of Clinical Medicine, University of Cambridge, Cambridge, GBR.
Khalpey AH; Department of Research, DataKinetic, Austin, USA.

Cureus ; 16(7): e65083, 2024 Jul.

Article de En | MEDLINE | ID: mdl-39171020

ABSTRACT

ABSTRACT

Objectives Large language models (LLMs), for example, ChatGPT, have performed exceptionally well in various fields. Of note, their success in answering postgraduate medical examination questions has been previously reported, indicating their possible utility in surgical education and training. This study evaluated the performance of four different LLMs on the American Board of Thoracic Surgery's (ABTS) Self-Education and Self-Assessment in Thoracic Surgery (SESATS) XIII question bank to investigate the potential applications of these LLMs in the education and training of future surgeons. Methods The dataset in this study comprised 400 best-of-four questions from the SESATS XIII exam. This included 220 adult cardiac surgery questions, 140 general thoracic surgery questions, 20 congenital cardiac surgery questions, and 20 cardiothoracic critical care questions. The GPT-3.5 (OpenAI, San Francisco, CA) and GPT-4 (OpenAI) models were evaluated, as well as Med-PaLM 2 (Google Inc., Mountain View, CA) and Claude 2 (Anthropic Inc., San Francisco, CA), and their respective performances were compared. The subspecialties included were adult cardiac, general thoracic, congenital cardiac, and critical care. Questions requiring visual information, such as clinical images or radiology, were excluded. Results GPT-4 demonstrated a significant improvement over GPT-3.5 overall (87.0% vs. 51.8% of questions answered correctly, p < 0.0001). GPT-4 also exhibited consistently improved performance across all subspecialties, with accuracy rates ranging from 70.0% to 90.0%, compared to 35.0% to 60.0% for GPT-3.5. When using the GPT-4 model, ChatGPT performed significantly better on the adult cardiac and general thoracic subspecialties (p < 0.0001). Conclusions Large language models, such as ChatGPT with the GPT-4 model, demonstrate impressive skill in understanding complex cardiothoracic surgical clinical information, achieving an overall accuracy rate of nearly 90.0% on the SESATS question bank. Our study shows significant improvement between successive GPT iterations. As LLM technology continues to evolve, its potential use in surgical education, training, and continuous medical education is anticipated to enhance patient outcomes and safety in the future.

Mots clés

american board of thoracic surgery; artificial intelligence; board exams; cardiothoracic surgical education; chatgpt; claude 2; large language models; med-palm 2; sesats

Texte intégral

Ajouter à My VHL

Imprimer

XML

PubMed Links

Recherche sur Google

Texte intégral: 1 Collection: 01-internacional Base de données: MEDLINE Langue: En Journal: Cureus Année: 2024 Type de document: Article Pays d'affiliation: États-Unis d'Amérique Pays de publication: États-Unis d'Amérique

Texte intégral

Ajouter à My VHL

Imprimer

XML

PubMed Links

Recherche sur Google