Search | VHL Regional Portal

Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts.

Hake, Joel; Crowley, Miles; Coy, Allison; Shanks, Denton; Eoff, Aundria; Kirmer-Voss, Kalee; Dhanda, Gurpreet; Parente, Daniel J.

Ann Fam Med ; 22(2): 113-120, 2024.

Article in English | MEDLINE | ID: mdl-38527823

ABSTRACT

PURPOSE: Worldwide clinical knowledge is expanding rapidly, but physicians have sparse time to review scientific literature. Large language models (eg, Chat Generative Pretrained Transformer [ChatGPT]), might help summarize and prioritize research articles to review. However, large language models sometimes "hallucinate" incorrect information. METHODS: We evaluated ChatGPT's ability to summarize 140 peer-reviewed abstracts from 14 journals. Physicians rated the quality, accuracy, and bias of the ChatGPT summaries. We also compared human ratings of relevance to various areas of medicine to ChatGPT relevance ratings. RESULTS: ChatGPT produced summaries that were 70% shorter (mean abstract length of 2,438 characters decreased to 739 characters). Summaries were nevertheless rated as high quality (median score 90, interquartile range [IQR] 87.0-92.5; scale 0-100), high accuracy (median 92.5, IQR 89.0-95.0), and low bias (median 0, IQR 0-7.5). Serious inaccuracies and hallucinations were uncommon. Classification of the relevance of entire journals to various fields of medicine closely mirrored physician classifications (nonlinear standard error of the regression [SER] 8.6 on a scale of 0-100). However, relevance classification for individual articles was much more modest (SER 22.3). CONCLUSIONS: Summaries generated by ChatGPT were 70% shorter than mean abstract length and were characterized by high quality, high accuracy, and low bias. Conversely, ChatGPT had modest ability to classify the relevance of articles to medical specialties. We suggest that ChatGPT can help family physicians accelerate review of the scientific literature and have developed software (pyJournalWatch) to support this application. Life-critical medical decisions should remain based on full, critical, and thoughtful evaluation of the full text of research articles in context with clinical guidelines.

Subject(s)

Medicine , Humans , Physicians, Family

Machine Learning Prediction of Urine Cultures in Primary Care.

Parente, Daniel; Shanks, Denton; Yedlinksy, Nicole; Hake, Joel; Dhanda, Gurpreet.

Ann Fam Med ; (21 Suppl 1)2023 01 01.

Article in English | MEDLINE | ID: mdl-36972528

ABSTRACT

Context: Antibiotics for suspected urinary tract infection (UTI) is appropriate only when an infection is present. Urine culture is definitive but takes >1 day to result. A machine learning urine culture predictor was recently devised for Emergency Department (ED) patients but requires use of urine microscopy ("NeedMicro" predictor), which is not routinely available in primary care (PC). Objective: To adapt this predictor to use only features available in primary care and determine if predictive accuracy generalizes to the primary care setting. We call this the "NoMicro" predictor. Study Design and Analysis: Multicenter, retrospective, observational, cross-sectional analysis. Machine learning predictors were trained using extreme gradient boosting, artificial neural networks, and random forests. Models were trained on the ED dataset and were evaluated on both the ED dataset (internal validation) and the PC dataset (external validation). Setting: United States (US) academic medical centers emergency department and family medicine clinic. Population Studied: 80387 (ED, previously described) and 472 (PC, newly curated) US adults. Instrument: Physicians performed retrospective chart review. The primary outcome extracted was pathogenic urine culture growing ≥100,000 colony forming units. Predictor variables included age; gender; dipstick urinalysis nitrites, leukocytes, clarity, glucose, protein, and blood; dysuria; abdominal pain; and history of UTI. Outcome Measures: Predictor overall discriminative performance (receiver operating characteristic area under the curve, ROC-AUC), performance statistics (e.g., sensitivity, negative predictive value, etc.), and calibration. Results: The "NoMicro" model performs similarly to the "NeedMicro" model in internal validation on the ED dataset: NoMicro ROC-AUC 0.862 (95% CI: 0.856-0.869) vs. NeedMicro 0.877 (95% CI: 0.871-0.884). External validation on the primary care dataset also yielded high performance (NoMicro ROC-AUC 0.850 [95% CI: 0.808-0.889]), despite being trained on Emergency Department data. Simulation of a hypothetical, retrospective clinical trial suggests the NoMicro model could be used to avoid antibiotic overuse by safely withhold antibiotics in low-risk patients. Conclusions: The hypothesis that the NoMicro predictor generalizes to both PC and ED contexts is supported. Prospective trials to determine the real-world impact of using the NoMicro model to reduce antibiotic overuse are appropriate.

Subject(s)

Urinalysis , Urinary Tract Infections , Adult , Humans , Urinary Tract Infections/diagnosis , Urinary Tract Infections/drug therapy , Retrospective Studies , Prospective Studies , Cross-Sectional Studies , Microscopy , Anti-Bacterial Agents/therapeutic use , Machine Learning , Emergency Service, Hospital , Primary Health Care , Urine

Adaptation and External Validation of Pathogenic Urine Culture Prediction in Primary Care Using Machine Learning.

Dhanda, Gurpreet; Asham, Mirna; Shanks, Denton; O'Malley, Nicole; Hake, Joel; Satyan, Megha Teeka; Yedlinsky, Nicole T; Parente, Daniel J.

Ann Fam Med ; 21(1): 11-18, 2023.

Article in English | MEDLINE | ID: mdl-36690486

ABSTRACT

BACKGROUND: Urinary tract infection (UTI) symptoms are common in primary care, but antibiotics are appropriate only when an infection is present. Urine culture is the reference standard test for infection, but results take >1 day. A machine learning predictor of urine cultures showed high accuracy for an emergency department (ED) population but required urine microscopy features that are not routinely available in primary care (the NeedMicro classifier). METHODS: We redesigned a classifier (NoMicro) that does not depend on urine microscopy and retrospectively validated it internally (ED data set) and externally (on a newly curated primary care [PC] data set) using a multicenter approach including 80,387 (ED) and 472 (PC) adults. We constructed machine learning models using extreme gradient boosting (XGBoost), artificial neural networks, and random forests (RFs). The primary outcome was pathogenic urine culture growing ≥100,000 colony forming units. Predictor variables included age; gender; dipstick urinalysis nitrites, leukocytes, clarity, glucose, protein, and blood; dysuria; abdominal pain; and history of UTI. RESULTS: Removal of microscopy features did not severely compromise performance under internal validation: NoMicro/XGBoost receiver operating characteristic area under the curve (ROC-AUC) 0.86 (95% CI, 0.86-0.87) vs NeedMicro 0.88 (95% CI, 0.87-0.88). Excellent performance in external (PC) validation was also observed: NoMicro/RF ROC-AUC 0.85 (95% CI, 0.81-0.89). Retrospective simulation suggested that NoMicro/RF can be used to safely withhold antibiotics for low-risk patients, thereby avoiding antibiotic overuse. CONCLUSIONS: The NoMicro classifier appears appropriate for PC. Prospective trials to adjudicate the balance of benefits and harms of using the NoMicro classifier are appropriate.

Subject(s)

Urinalysis , Urinary Tract Infections , Adult , Humans , Retrospective Studies , Prospective Studies , Microscopy , Urinary Tract Infections/diagnosis , Anti-Bacterial Agents , Machine Learning , Primary Health Care/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL