ChatGPT efficacy for answering musculoskeletal anatomy questions: a study evaluating quality and consistency between raters and timepoints.

Mantzou, Nikolaos; Ediaroglou, Vasileios; Drakonaki, Elena; Syggelos, Spyros A; Karageorgos, Filippos F; Totlis, Trifon

Mantzou, Nikolaos; Ediaroglou, Vasileios; Drakonaki, Elena; Syggelos, Spyros A; Karageorgos, Filippos F; Totlis, Trifon.

Afiliación

Mantzou N; School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
Ediaroglou V; School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
Drakonaki E; Department of Anatomy, Clinical Radiologist University of Crete, Crete, Greece.
Syggelos SA; Department of Anatomy-Histology-Embryology, School of Medicine, University of Patras, Patras, Greece.
Karageorgos FF; School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece.
Totlis T; Department of Anatomy and Surgical Anatomy, School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, 54124, Greece. totlis@auth.gr.

Surg Radiol Anat ; 2024 Sep 12.

Article en En | MEDLINE | ID: mdl-39264461

ABSTRACT

ABSTRACT

PURPOSE:

There is increasing interest in the use of digital platforms such as ChatGPT for anatomy education. This study aims to evaluate the efficacy of ChatGPT in providing accurate and consistent responses to questions focusing on musculoskeletal anatomy across various time points (hours and days).

METHODS:

A selection of 6 Anatomy-related questions were asked to ChatGPT 3.5 in 4 different timepoints. All answers were rated blindly by 3 expert raters for quality according to a 5 -point Likert Scale. Difference of 0 or 1 points in Likert scale scores between raters was considered as agreement and between different timepoints was considered as consistent indicating good reproducibility.

RESULTS:

There was significant variation in the quality of the answers ranging from extremely good to very poor quality. There was also variation of consistency levels between different timepoints. Answers were rated as good quality (≥ 3 in Likert scale) in 50% of cases (3/6) and as consistent in 66.6% (4/6) of cases. In the low-quality answers, significant mistakes, conflicting data or lack of information were encountered.

CONCLUSION:

As of the time of this article, the quality and consistency of the ChatGPT v3.5 answers is variable, thus limiting its utility as independent and reliable resource of learning musculoskeletal anatomy. Validating information by reviewing the anatomical literature is highly recommended.

Palabras clave

Anatomy; Artificial intelligence; ChatGPT; Large language models

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Idioma: En Revista: Surg Radiol Anat Asunto de la revista: ANATOMIA / RADIOLOGIA Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Idioma: En Revista: Surg Radiol Anat Asunto de la revista: ANATOMIA / RADIOLOGIA Año: 2024 Tipo del documento: Article