ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis.

Ahmed, Wasil; Saturno, Michael; Rajjoub, Rami; Duey, Akiro H; Zaidat, Bashar; Hoang, Timothy; Restrepo Mejia, Mateo; Gallate, Zachary S; Shrestha, Nancy; Tang, Justin; Zapolsky, Ivan; Kim, Jun S; Cho, Samuel K

Ahmed, Wasil; Saturno, Michael; Rajjoub, Rami; Duey, Akiro H; Zaidat, Bashar; Hoang, Timothy; Restrepo Mejia, Mateo; Gallate, Zachary S; Shrestha, Nancy; Tang, Justin; Zapolsky, Ivan; Kim, Jun S; Cho, Samuel K.

Afiliación

Ahmed W; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Saturno M; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Rajjoub R; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Duey AH; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Zaidat B; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Hoang T; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Restrepo Mejia M; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Gallate ZS; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Shrestha N; Chicago Medical School at Rosalind Franklin University, North Chicago, IL, USA.
Tang J; Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Zapolsky I; Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA.
Kim JS; Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA.
Cho SK; Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA. Samuel.Cho@mountsinai.org.

Eur Spine J ; 2024 Mar 15.

Article en En | MEDLINE | ID: mdl-38489044

ABSTRACT

ABSTRACT

BACKGROUND CONTEXT Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI's generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis.

PURPOSE:

The study aimed to compare ChatGPT's concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT's accuracy within the context of the most recent literature.

METHODS:

ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as "concordant" or "nonconcordant" relative to those put forth by NASS. A response was considered "concordant" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of "nonconcordant" were further stratified into two subcategories "Insufficient" or "Over-conclusive," to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests.

RESULTS:

ChatGPT-3.5 answered 13 of NASS's 28 total clinical questions in concordance with NASS's guidelines (46.4%). Categorical breakdown is as follows Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5's concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5's nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being "over-conclusive" (12/15, 80%), rather than "insufficient" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0's concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104).

CONCLUSIONS:

This study sheds light on the duality of LLM applications within clinical settings one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.

Palabras clave

Artificial intelligence; Clinical guidelines; Degenerative spondylolisthesis; Large language models; Spine

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: Eur Spine J Asunto de la revista: ORTOPEDIA Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google