Search | VHL CLAP/WR-PAHO/WHO

Accuracy and Reliability of Chatbot Responses to Physician Questions.

Goodman, Rachel S; Patrinely, J Randall; Stone, Cosby A; Zimmerman, Eli; Donald, Rebecca R; Chang, Sam S; Berkowitz, Sean T; Finn, Avni P; Jahangir, Eiman; Scoville, Elizabeth A; Reese, Tyler S; Friedman, Debra L; Bastarache, Julie A; van der Heijden, Yuri F; Wright, Jordan J; Ye, Fei; Carter, Nicholas; Alexander, Matthew R; Choe, Jennifer H; Chastain, Cody A; Zic, John A; Horst, Sara N; Turker, Isik; Agarwal, Rajiv; Osmundson, Evan; Idrees, Kamran; Kiernan, Colleen M; Padmanabhan, Chandrasekhar; Bailey, Christina E; Schlegel, Cameron E; Chambless, Lola B; Gibson, Michael K; Osterman, Travis J; Wheless, Lee E; Johnson, Douglas B.

JAMA Netw Open ; 6(10): e2336483, 2023 10 02.

Article in English | MEDLINE | ID: mdl-37782499

ABSTRACT

Importance: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. Objective: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence-generated medical information. Design, Setting, and Participants: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. Main Outcomes and Measures: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. Results: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002). Conclusions and Relevance: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Subject(s)

Artificial Intelligence , Physicians , Humans , Cross-Sectional Studies , Reproducibility of Results , Software

Predicted expression of genes involved in the thiopurine metabolic pathway and azathioprine discontinuation due to myelotoxicity.

Daniel, Laura L; Dickson, Alyson L; Zanussi, Jacy T; Miller-Fleming, Tyne W; Straub, Peter S; Wei, Wei-Qi; Plummer, W Dale; Dupont, William D; Liu, Ge; Anandi, Prathima; Reese, Tyler S; Birdwell, Kelly A; Kawai, Vivian K; Hung, Adriana M; Cox, Nancy J; Feng, QiPing; Stein, C Michael; Chung, Cecilia P.

Clin Transl Sci ; 15(4): 859-865, 2022 04.

Article in English | MEDLINE | ID: mdl-35118815

ABSTRACT

TPMT and NUDT15 variants explain less than 25% of azathioprine-associated myelotoxicity. There are 25 additional genes in the thiopurine pathway that could also contribute to azathioprine myelotoxicity. We hypothesized that among TPMT and NUDT15 normal metabolizers, a score combining the genetically predicted expression of other proteins in the thiopurine pathway would be associated with a higher risk for azathioprine discontinuation due to myelotoxicity. We conducted a retrospective cohort study of new users of azathioprine who were normal TPMT and NUDT15 metabolizers. In 1201 White patients receiving azathioprine for an inflammatory disease, we used relaxed Least Absolute Shrinkage and Selection Operator (LASSO) regression to select genes that built a score for discontinuing azathioprine due to myelotoxicity. The score incorporated the predicted expression of AOX1 and NME1. Patients in the highest score tertile had a higher risk of discontinuing azathioprine compared to those in the lowest tertile (hazard ratio [HR] = 2.15, 95% confidence interval [CI] = 1.11-4.19, p = 0.024). Results remained significant after adjusting for a propensity score, including sex, tertile of calendar year at initial dose, initial dose, age at baseline, indication, prior TPMT testing, and the first 10 principal components of the genetic data (HR = 2.11, 95% CI = 1.08-4.13, p = 0.030). We validated the results in a cohort (N = 517 non-White patients and those receiving azathioprine to prevent transplant rejection) that included all other patients receiving azathioprine (HR = 2.00, (95% CI = 1.09-3.65, p = 0.024). In conclusion, among patients who were TPMT and NUDT15 normal metabolizers, a score combining the predicted expression of AOX1 and NME1 was associated with an increased risk for discontinuing azathioprine due to myelotoxicity.

Subject(s)

Azathioprine , Pyrophosphatases , Azathioprine/adverse effects , Humans , Metabolic Networks and Pathways/genetics , Methyltransferases/genetics , Pyrophosphatases/genetics , Retrospective Studies

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL