Search | VHL Regional Portal

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare.

Guo, Lin Lawrence; Morse, Keith E; Aftandilian, Catherine; Steinberg, Ethan; Fries, Jason; Posada, Jose; Fleming, Scott Lanyon; Lemmon, Joshua; Jessa, Karim; Shah, Nigam; Sung, Lillian.

BMC Med Inform Decis Mak ; 24(1): 51, 2024 Feb 14.

Article in English | MEDLINE | ID: mdl-38355486

ABSTRACT

BACKGROUND: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. METHODS: This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen's Kappa, sensitivity and specificity were calculated for each lab-based severity level. RESULTS: The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen's Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds. CONCLUSIONS: Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.

Subject(s)

Hyperkalemia , Neutropenia , Humans , Delivery of Health Care , Machine Learning , Sensitivity and Specificity

Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks.

Lemmon, Joshua; Guo, Lin Lawrence; Steinberg, Ethan; Morse, Keith E; Fleming, Scott Lanyon; Aftandilian, Catherine; Pfohl, Stephen R; Posada, Jose D; Shah, Nigam; Fries, Jason; Sung, Lillian.

J Am Med Inform Assoc ; 30(12): 2004-2011, 2023 11 17.

Article in English | MEDLINE | ID: mdl-37639620

ABSTRACT

OBJECTIVE: Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks. MATERIALS AND METHODS: This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients. RESULTS: When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority). CONCLUSIONS: Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.

Subject(s)

Inpatients , Machine Learning , Humans , Adult , Child , Retrospective Studies , Supervised Machine Learning , Electronic Health Records

EHR foundation models improve robustness in the presence of temporal distribution shift.

Guo, Lin Lawrence; Steinberg, Ethan; Fleming, Scott Lanyon; Posada, Jose; Lemmon, Joshua; Pfohl, Stephen R; Shah, Nigam; Fries, Jason; Sung, Lillian.

Sci Rep ; 13(1): 3767, 2023 03 07.

Article in English | MEDLINE | ID: mdl-36882576

ABSTRACT

Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.

Subject(s)

Electric Power Supplies , Electronic Health Records , Humans , Hospital Mortality , Hospitalization

Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.

Lemmon, Joshua; Guo, Lin Lawrence; Posada, Jose; Pfohl, Stephen R; Fries, Jason; Fleming, Scott Lanyon; Aftandilian, Catherine; Shah, Nigam; Sung, Lillian.

Methods Inf Med ; 62(1-02): 60-70, 2023 05.

Article in English | MEDLINE | ID: mdl-36812932

ABSTRACT

BACKGROUND: Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. METHODS: Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group. RESULTS: The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task. CONCLUSIONS: While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.

Subject(s)

Clinical Medicine , Sepsis , Female , Pregnancy , Humans , Hospital Mortality , Length of Stay , Machine Learning

Performance of a Commonly Used Pressure Injury Risk Model Under Changing Incidence.

Fleming, Scott Lanyon; McFarlane, Kelly Heavner; Thapa, Isha; Johnson, Andrea K; Kruger, Jenna F; Shin, Andrew Y; Scheinker, David; Donnelly, Lane F.

Jt Comm J Qual Patient Saf ; 48(3): 131-138, 2022 03.

Article in English | MEDLINE | ID: mdl-34866024

ABSTRACT

BACKGROUND: Hospital-acquired pressure injuries (HAPIs) cause patient harm and increase health care costs. We sought to evaluate the performance of the Braden QD Scale-associated changes in HAPI incidence. METHODS: Using electronic health records data from a quaternary children's hospital, we evaluated the association between Braden QD scores and patient risk of HAPI. We analyzed how this relationship changed during a hospitalwide quality HAPI reduction initiative. RESULTS: Of 23,532 unique patients, 108 (0.46%, 95% confidence interval [CI]â¯=â¯0.38%-0.55%) experienced a HAPI. Every 1-point increase in the Braden QD score was associated with a 41% increase in the patient's odds of developing a HAPI (odds ratio [OR]â¯=â¯1.41, 95% CIâ¯=â¯1.36-1.46, p < 0.001). HAPI incidence declined significantly following implementation of a HAPI-reduction initiative (ßâ¯=â¯-0.09, 95% CIâ¯=â¯-0.11 - -0.07, p < 0.001), as did Braden QD positive predictive value (ßâ¯=â¯-0.29, 95% CIâ¯=â¯-0.44 - -0.14, p < 0.001) and specificity (ßâ¯=â¯-0.28, 95% CIâ¯=â¯-0.43 - -0.14, p < 0.001), while sensitivity (ßâ¯=â¯0.93, 95% CIâ¯=â¯0.30-1.75, pâ¯=â¯0.01) and the concordance statistic (ßâ¯=â¯0.18, 95% CIâ¯=â¯0.15-0.21, p < 0.001) increased significantly. CONCLUSION: Decreases in HAPI incidence following a quality improvement initiative were associated with (1) significant deterioration in threshold-dependent performance measures such as specificity and precision and (2) significant improvements in threshold-independent performance measures such as the concordance statistic. The performance of the Braden QD Scale is more stable as a tool that continuously measures risk than as a prediction tool.

Subject(s)

Pressure Ulcer , Child , Humans , Incidence , Pressure Ulcer/epidemiology , Pressure Ulcer/prevention & control , Quality Improvement , Retrospective Studies , Risk Assessment , Risk Factors

Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.

Guo, Lin Lawrence; Pfohl, Stephen R; Fries, Jason; Posada, Jose; Fleming, Scott Lanyon; Aftandilian, Catherine; Shah, Nigam; Sung, Lillian.

Appl Clin Inform ; 12(4): 808-815, 2021 08.

Article in English | MEDLINE | ID: mdl-34470057

ABSTRACT

OBJECTIVE: The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts. METHODS: Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects. RESULTS: Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n = 11) than discrimination deterioration (n = 3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n = 15) were more common than feature-level approaches (n = 2), with the most common approaches being model refitting (n = 12), probability calibration (n = 7), model updating (n = 6), and model selection (n = 6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination. CONCLUSION: There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.

Subject(s)

Clinical Medicine , Machine Learning , Clinical Decision-Making , Cognition

Detecting Developmental Delay and Autism Through Machine Learning Models Using Home Videos of Bangladeshi Children: Development and Validation Study.

Tariq, Qandeel; Fleming, Scott Lanyon; Schwartz, Jessey Nicole; Dunlap, Kaitlyn; Corbin, Conor; Washington, Peter; Kalantarian, Haik; Khan, Naila Z; Darmstadt, Gary L; Wall, Dennis Paul.

J Med Internet Res ; 21(4): e13822, 2019 04 24.

Article in English | MEDLINE | ID: mdl-31017583

ABSTRACT

BACKGROUND: Autism spectrum disorder (ASD) is currently diagnosed using qualitative methods that measure between 20-100 behaviors, can span multiple appointments with trained clinicians, and take several hours to complete. In our previous work, we demonstrated the efficacy of machine learning classifiers to accelerate the process by collecting home videos of US-based children, identifying a reduced subset of behavioral features that are scored by untrained raters using a machine learning classifier to determine children's "risk scores" for autism. We achieved an accuracy of 92% (95% CI 88%-97%) on US videos using a classifier built on five features. OBJECTIVE: Using videos of Bangladeshi children collected from Dhaka Shishu Children's Hospital, we aim to scale our pipeline to another culture and other developmental delays, including speech and language conditions. METHODS: Although our previously published and validated pipeline and set of classifiers perform reasonably well on Bangladeshi videos (75% accuracy, 95% CI 71%-78%), this work improves on that accuracy through the development and application of a powerful new technique for adaptive aggregation of crowdsourced labels. We enhance both the utility and performance of our model by building two classification layers: The first layer distinguishes between typical and atypical behavior, and the second layer distinguishes between ASD and non-ASD. In each of the layers, we use a unique rater weighting scheme to aggregate classification scores from different raters based on their expertise. We also determine Shapley values for the most important features in the classifier to understand how the classifiers' process aligns with clinical intuition. RESULTS: Using these techniques, we achieved an accuracy (area under the curve [AUC]) of 76% (SD 3%) and sensitivity of 76% (SD 4%) for identifying atypical children from among developmentally delayed children, and an accuracy (AUC) of 85% (SD 5%) and sensitivity of 76% (SD 6%) for identifying children with ASD from those predicted to have other developmental delays. CONCLUSIONS: These results show promise for using a mobile video-based and machine learning-directed approach for early and remote detection of autism in Bangladeshi children. This strategy could provide important resources for developmental health in developing countries with few clinical resources for diagnosis, helping children get access to care at an early age. Future research aimed at extending the application of this approach to identify a range of other conditions and determine the population-level burden of developmental disabilities and impairments will be of high value.

Subject(s)

Autism Spectrum Disorder/diagnosis , Developmental Disabilities/diagnosis , Machine Learning/standards , Video Recording/methods , Bangladesh , Child , Child, Preschool , Female , Humans , Male , Validation Studies as Topic

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL