ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam.

Fiedler, Benjamin; Azua, Eric N; Phillips, Todd; Ahmed, Adil Shahzad

Fiedler, Benjamin; Azua, Eric N; Phillips, Todd; Ahmed, Adil Shahzad.

Afiliación

Fiedler B; Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA. Electronic address: Benjamin.Fiedler@bcm.edu.
Azua EN; Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA.
Phillips T; Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA.
Ahmed AS; Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA.

J Shoulder Elbow Surg ; 33(9): 1888-1893, 2024 Sep.

Article en En | MEDLINE | ID: mdl-38580067

ABSTRACT

ABSTRACT

BACKGROUND:

While multiple studies have tested the ability of large language models (LLMs), such as ChatGPT, to pass standardized medical exams at different levels of training, LLMs have never been tested on surgical sub-specialty examinations, such as the American Shoulder and Elbow Surgeons (ASES) Maintenance of Certification (MOC). The purpose of this study was to compare results of ChatGPT 3.5, GPT-4, and fellowship-trained surgeons on the 2023 ASES MOC self-assessment exam.

METHODS:

ChatGPT 3.5 and GPT-4 were subjected to the same set of text-only questions from the ASES MOC exam, and GPT-4 was additionally subjected to image-based MOC exam questions. Question responses from both models were compared against the correct answers. Performance of both models was compared to corresponding average human performance on the same question subsets. One sided proportional z-test were utilized to analyze data.

RESULTS:

Humans performed significantly better than Chat GPT 3.5 on exclusively text-based questions (76.4% vs. 60.8%, P = .044). Humans also performed significantly better than GPT 4 on image-based questions (73.9% vs. 53.2%, P = .019). There was no significant difference between humans and GPT 4 in text-based questions (76.4% vs. 66.7%, P = .136). Accounting for all questions, humans significantly outperformed GPT-4 (75.3% vs. 60.2%, P = .012). GPT-4 did not perform statistically significantly betterer than ChatGPT 3.5 on text-only questions (66.7% vs. 60.8%, P = .268).

DISCUSSION:

Although human performance was overall superior, ChatGPT demonstrated the capacity to analyze orthopedic information and answer specialty-specific questions on the ASES MOC exam for both text and image-based questions. With continued advancements in deep learning, LLMs may someday rival exam performance of fellowship-trained surgeons.

Asunto(s)

Certificación; Competencia Clínica; Evaluación Educacional; Humanos; Estados Unidos; Evaluación Educacional/métodos; Ortopedia/educación; Cirujanos Ortopédicos/educación

Palabras clave

American Shoulder and Elbow Surgeons (ASES); ChatGPT; artificial intelligence (AI); elbow; machine learning; maintenance of certification (MOC); shoulder

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Certificación / Competencia Clínica / Evaluación Educacional Límite: Humans País/Región como asunto: America do norte Idioma: En Revista: J Shoulder Elbow Surg Asunto de la revista: ORTOPEDIA Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google