|

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study.

Goh, Ethan; Gallo, Robert; Hom, Jason; Strong, Eric; Weng, Yingjie; Kerman, Hannah; Cool, Josephine; Kanjee, Zahir; Parsons, Andrew S; Ahuja, Neera; Horvitz, Eric; Yang, Daniel; Milstein, Arnold; Olson, Andrew P J; Rodman, Adam; Chen, Jonathan H.

medRxiv ; 2024 Mar 14.

Article En | MEDLINE | ID: mdl-38559045

Importance: Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning. Objective: To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources. Design: Multi-center, randomized clinical vignette study. Setting: The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions. Participants: Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine. Interventions: Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams. Main Outcomes and Measures: The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis. Results: 50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group. Conclusions and Relevance: In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.

Strong, Eric; DiGiammarino, Alicia; Weng, Yingjie; Kumar, Andre; Hosamani, Poonam; Hom, Jason; Chen, Jonathan H.

JAMA Intern Med ; 183(9): 1028-1030, 2023 09 01.

Article En | MEDLINE | ID: mdl-37459090

This study compares performance on free-response clinical reasoning examinations of first- and second-year medical students vs 2 models of a popular chatbot.

Students, Medical , Humans , Educational Measurement/methods , Physical Examination , Software , Clinical Reasoning

Performance of ChatGPT on free-response, clinical reasoning exams.

Strong, Eric; DiGiammarino, Alicia; Weng, Yingjie; Basaviah, Preetha; Hosamani, Poonam; Kumar, Andre; Nevins, Andrew; Kugler, John; Hom, Jason; Chen, Jonathan H.

medRxiv ; 2023 Mar 29.

Article En | MEDLINE | ID: mdl-37034742

Importance: Studies show that ChatGPT, a general purpose large language model chatbot, could pass the multiple-choice US Medical Licensing Exams, but the model's performance on open-ended clinical reasoning is unknown. Objective: To determine if ChatGPT is capable of consistently meeting the passing threshold on free-response, case-based clinical reasoning assessments. Design: Fourteen multi-part cases were selected from clinical reasoning exams administered to pre-clerkship medical students between 2019 and 2022. For each case, the questions were run through ChatGPT twice and responses were recorded. Two clinician educators independently graded each run according to a standardized grading rubric. To further assess the degree of variation in ChatGPT's performance, we repeated the analysis on a single high-complexity case 20 times. Setting: A single US medical school. Participants: ChatGPT. Main Outcomes and Measures: Passing rate of ChatGPT's scored responses and the range in model performance across multiple run throughs of a single case. Results: 12 out of the 28 ChatGPT exam responses achieved a passing score (43%) with a mean score of 69% (95% CI: 65% to 73%) compared to the established passing threshold of 70%. When given the same case 20 separate times, ChatGPT's performance on that case varied with scores ranging from 56% to 81%. Conclusions and Relevance: ChatGPT's ability to achieve a passing performance in nearly half of the cases analyzed demonstrates the need to revise clinical reasoning assessments and incorporate artificial intelligence (AI)-related topics into medical curricula and practice.

New Daily Persistent Headache in a Pediatric Population.

Strong, Eric; Pierce, Emily Linda; Langdon, Raquel; Strelzik, Jeffery; McClintock, William; Cameron, Mark; Furda, Mary; DiSabella, Marc.

J Child Neurol ; 36(10): 888-893, 2021 09.

Article En | MEDLINE | ID: mdl-34048280

INTRODUCTION: New daily persistent headache (NDPH) is a primary headache disorder characterized by an intractable, daily, and unremitting headache lasting for at least 3 months. Currently, there are limited studies in the pediatric population describing the characteristics of NDPH. OBJECTIVE: The objective of the current study is to describe the characteristics of NDPH in pediatric patients presenting to a headache program at a tertiary referral center. METHODS: The participants in the current study were pediatric patients who attended the Headache Clinic at Children's National Hospital between 2016 and 2018. All patients seen in the Headache Clinic were enrolled in an institutional review board-approved patient registry. RESULTS: Between 2016 and 2018, NDPH was diagnosed in 245 patients, representing 14% of the total headache population. NDPH patients were predominantly female (78%) and white (72%). The median age was 14.8 years. The median pain intensity was 6 of 10 (standard deviation = 1.52). Most patients reported experiencing migrainous features, namely, photophobia (85%), phonophobia (85%), and a reduced activity level (88%). Overall, 33% of patients had failed at least 1 preventive medication, and 56% had failed at least 1 abortive medication. Furthermore, 36% of patients were additionally diagnosed with medication overuse headache. CONCLUSION: NDPH is a relatively frequent disorder among pediatric chronic headache patients. The vast majority of these patients experience migrainous headache characteristics and associated symptoms and are highly refractory to treatment-as evidenced by a strong predisposition to medication overuse headache and high rates of failed preventive management.

Headache Disorders/epidemiology , Headache Disorders/physiopathology , Adolescent , Child , District of Columbia/epidemiology , Female , Humans , Male

YouTube as an educational resource for learning ECGs.

Strong, Eric.

J Electrocardiol ; 47(5): 758-9, 2014.

Article En | MEDLINE | ID: mdl-24973138

Arrhythmias, Cardiac/diagnosis , Computer-Assisted Instruction/statistics & numerical data , Educational Measurement/statistics & numerical data , Electrocardiography/statistics & numerical data , Internet/statistics & numerical data , Software , User-Computer Interface , Humans