Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.

Yang, JaeWon; Ardavanis, Kyle S; Slack, Katherine E; Fernando, Navin D; Della Valle, Craig J; Hernandez, Nicholas M

Yang, JaeWon; Ardavanis, Kyle S; Slack, Katherine E; Fernando, Navin D; Della Valle, Craig J; Hernandez, Nicholas M.

Affiliation

Yang J; Department of Orthopaedic Surgery, University of Washington, Seattle, Washington.
Ardavanis KS; Department of Orthopaedic Surgery, Madigan Medical Center, Tacoma, Washington.
Slack KE; Elson S. Floyd College of Medicine, Washington State University, Spokane, Washington.
Fernando ND; Department of Orthopaedic Surgery, University of Washington, Seattle, Washington.
Della Valle CJ; Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, Illinois.
Hernandez NM; Department of Orthopaedic Surgery, University of Washington, Seattle, Washington.

J Arthroplasty ; 39(5): 1184-1190, 2024 May.

Article in En | MEDLINE | ID: mdl-38237878

ABSTRACT

ABSTRACT

BACKGROUND:

Advancements in artificial intelligence (AI) have led to the creation of large language models (LLMs), such as Chat Generative Pretrained Transformer (ChatGPT) and Bard, that analyze online resources to synthesize responses to user queries. Despite their popularity, the accuracy of LLM responses to medical questions remains unknown. This study aimed to compare the responses of ChatGPT and Bard regarding treatments for hip and knee osteoarthritis with the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines (CPGs) recommendations.

METHODS:

Both ChatGPT (Open AI) and Bard (Google) were queried regarding 20 treatments (10 for hip and 10 for knee osteoarthritis) from the AAOS CPGs. Responses were classified by 2 reviewers as being in "Concordance," "Discordance," or "No Concordance" with AAOS CPGs. A Cohen's Kappa coefficient was used to assess inter-rater reliability, and Chi-squared analyses were used to compare responses between LLMs.

RESULTS:

Overall, ChatGPT and Bard provided responses that were concordant with the AAOS CPGs for 16 (80%) and 12 (60%) treatments, respectively. Notably, ChatGPT and Bard encouraged the use of non-recommended treatments in 30% and 60% of queries, respectively. There were no differences in performance when evaluating by joint or by recommended versus non-recommended treatments. Studies were referenced in 6 (30%) of the Bard responses and none (0%) of the ChatGPT responses. Of the 6 Bard responses, studies could only be identified for 1 (16.7%). Of the remaining, 2 (33.3%) responses cited studies in journals that did not exist, 2 (33.3%) cited studies that could not be found with the information given, and 1 (16.7%) provided links to unrelated studies.

CONCLUSIONS:

Both ChatGPT and Bard do not consistently provide responses that align with the AAOS CPGs. Consequently, physicians and patients should temper expectations on the guidance AI platforms can currently provide.

Subject(s)

Osteoarthritis, Hip; Osteoarthritis, Knee; Humans; Osteoarthritis, Knee/therapy; Artificial Intelligence; Osteoarthritis, Hip/therapy; Reproducibility of Results; Language

Key words

ChatGPT; artificial intelligence; bard; large language models; machine learning

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Osteoarthritis, Hip / Osteoarthritis, Knee Type of study: Guideline Limits: Humans Language: En Journal: J Arthroplasty Year: 2024 Document type: Article

Fulltext

XML

PubMed Links

Search on Google