Accuracy and Reliability of Chatbot Responses to Physician Questions.

Goodman, Rachel S; Patrinely, J Randall; Stone, Cosby A; Zimmerman, Eli; Donald, Rebecca R; Chang, Sam S; Berkowitz, Sean T; Finn, Avni P; Jahangir, Eiman; Scoville, Elizabeth A; Reese, Tyler S; Friedman, Debra L; Bastarache, Julie A; van der Heijden, Yuri F; Wright, Jordan J; Ye, Fei; Carter, Nicholas; Alexander, Matthew R; Choe, Jennifer H; Chastain, Cody A; Zic, John A; Horst, Sara N; Turker, Isik; Agarwal, Rajiv; Osmundson, Evan; Idrees, Kamran; Kiernan, Colleen M; Padmanabhan, Chandrasekhar; Bailey, Christina E; Schlegel, Cameron E; Chambless, Lola B; Gibson, Michael K; Osterman, Travis J; Wheless, Lee E; Johnson, Douglas B

Goodman, Rachel S; Patrinely, J Randall; Stone, Cosby A; Zimmerman, Eli; Donald, Rebecca R; Chang, Sam S; Berkowitz, Sean T; Finn, Avni P; Jahangir, Eiman; Scoville, Elizabeth A; Reese, Tyler S; Friedman, Debra L; Bastarache, Julie A; van der Heijden, Yuri F; Wright, Jordan J; Ye, Fei; Carter, Nicholas; Alexander, Matthew R; Choe, Jennifer H; Chastain, Cody A; Zic, John A; Horst, Sara N; Turker, Isik; Agarwal, Rajiv; Osmundson, Evan; Idrees, Kamran; Kiernan, Colleen M; Padmanabhan, Chandrasekhar; Bailey, Christina E; Schlegel, Cameron E; Chambless, Lola B; Gibson, Michael K; Osterman, Travis J; Wheless, Lee E; Johnson, Douglas B.

Affiliation

Goodman RS; Vanderbilt University School of Medicine, Nashville, Tennessee.
Patrinely JR; Department of Dermatology, Vanderbilt University Medical Center, Nashville, Tennessee.
Stone CA; Department of Allergy, Pulmonology, and Critical Care, Vanderbilt University Medical Center, Nashville, Tennessee.
Zimmerman E; Department of Neurology, Vanderbilt University Medical Center, Nashville, Tennessee.
Donald RR; Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee.
Chang SS; Department of Urology, Vanderbilt University Medical Center, Nashville, Tennessee.
Berkowitz ST; Vanderbilt Eye Institute, Department of Ophthalmology, Vanderbilt University Medical, Nashville, Tennessee.
Finn AP; Vanderbilt Eye Institute, Department of Ophthalmology, Vanderbilt University Medical, Nashville, Tennessee.
Jahangir E; Department of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.
Scoville EA; Department of Gastroenterology, Hepatology, and Nutrition, Vanderbilt University Medical Center, Nashville, Tennessee.
Reese TS; Department of Rheumatology and Immunology, Vanderbilt University Medical Center, Nashville, Tennessee.
Friedman DL; Department of Pediatric Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
Bastarache JA; Department of Allergy, Pulmonology, and Critical Care, Vanderbilt University Medical Center, Nashville, Tennessee.
van der Heijden YF; Department of Infectious Disease, Vanderbilt University Medical Center, Nashville, Tennessee.
Wright JJ; Department of Diabetes, Endocrinology, and Metabolism, Vanderbilt University Medical Center, Nashville, Tennessee.
Ye F; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee.
Carter N; Division of Trauma and Surgical Critical Care, University of Miami Miller School of Medicine, Miami, Florida.
Alexander MR; Department of Cardiovascular Medicine and Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee.
Choe JH; Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
Chastain CA; Department of Infectious Disease, Vanderbilt University Medical Center, Nashville, Tennessee.
Zic JA; Department of Dermatology, Vanderbilt University Medical Center, Nashville, Tennessee.
Horst SN; Department of Gastroenterology, Hepatology, and Nutrition, Vanderbilt University Medical Center, Nashville, Tennessee.
Turker I; Department of Cardiology, Washington University School of Medicine in St Louis, St Louis, Missouri.
Agarwal R; Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
Osmundson E; Department of Radiation Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
Idrees K; Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
Kiernan CM; Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
Padmanabhan C; Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
Bailey CE; Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
Schlegel CE; Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
Chambless LB; Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
Gibson MK; Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
Osterman TJ; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.
Wheless LE; Department of Dermatology, Vanderbilt University Medical Center, Nashville, Tennessee.
Johnson DB; Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.

JAMA Netw Open ; 6(10): e2336483, 2023 10 02.

Article in En | MEDLINE | ID: mdl-37782499

ABSTRACT

ABSTRACT

Importance Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency.

Objective:

To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence-generated medical information. Design, Setting, and

Participants:

Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. Main Outcomes and

Measures:

Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses.

Results:

Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002). Conclusions and Relevance In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Subject(s)

Artificial Intelligence; Physicians; Humans; Cross-Sectional Studies; Reproducibility of Results; Software

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Physicians / Artificial Intelligence Type of study: Observational_studies / Prevalence_studies / Prognostic_studies / Risk_factors_studies Limits: Humans Language: En Journal: JAMA Netw Open Year: 2023 Document type: Article

Fulltext

XML

PubMed Links

Search on Google