Search | VHL Regional Portal

1.

A multi-center study on the adaptability of a shared foundation model for electronic health records.

Guo, Lin Lawrence; Fries, Jason; Steinberg, Ethan; Fleming, Scott Lanyon; Morse, Keith; Aftandilian, Catherine; Posada, Jose; Shah, Nigam; Sung, Lillian.

NPJ Digit Med ; 7(1): 171, 2024 Jun 27.

Article in English | MEDLINE | ID: mdl-38937550

ABSTRACT

Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular components adaptable for various downstream tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across hospitals and their performance in local tasks. This multi-center study examined the adaptability of a publicly accessible structured EHR foundation model (FMSM), trained on 2.57 M patient records from Stanford Medicine. Experiments used EHR data from The Hospital for Sick Children (SickKids) and Medical Information Mart for Intensive Care (MIMIC-IV). We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of locally training models from scratch, including a local foundation model. Evaluations on 8 clinical prediction tasks showed that adapting the off-the-shelf FMSM matched the performance of gradient boosting machines (GBM) locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. Continued pretraining on local data showed FMSM required fewer than 1% of training examples to match the fully trained GBM's performance, and was 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings demonstrate that adapting EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.

2.

Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example.

Chiang, Chia-Chun; Fries, Jason A.

Neurol Clin Pract ; 14(3): e200311, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38601858

3.

Scalable Approach to Consumer Wearable Postmarket Surveillance: Development and Validation Study.

Yoo, Richard M; Viggiano, Ben T; Pundi, Krishna N; Fries, Jason A; Zahedivash, Aydin; Podchiyska, Tanya; Din, Natasha; Shah, Nigam H.

JMIR Med Inform ; 12: e51171, 2024 Apr 04.

Article in English | MEDLINE | ID: mdl-38596848

ABSTRACT

Background: With the capability to render prediagnoses, consumer wearables have the potential to affect subsequent diagnoses and the level of care in the health care delivery setting. Despite this, postmarket surveillance of consumer wearables has been hindered by the lack of codified terms in electronic health records (EHRs) to capture wearable use. Objective: We sought to develop a weak supervision-based approach to demonstrate the feasibility and efficacy of EHR-based postmarket surveillance on consumer wearables that render atrial fibrillation (AF) prediagnoses. Methods: We applied data programming, where labeling heuristics are expressed as code-based labeling functions, to detect incidents of AF prediagnoses. A labeler model was then derived from the predictions of the labeling functions using the Snorkel framework. The labeler model was applied to clinical notes to probabilistically label them, and the labeled notes were then used as a training set to fine-tune a classifier called Clinical-Longformer. The resulting classifier identified patients with an AF prediagnosis. A retrospective cohort study was conducted, where the baseline characteristics and subsequent care patterns of patients identified by the classifier were compared against those who did not receive a prediagnosis. Results: The labeler model derived from the labeling functions showed high accuracy (0.92; F1-score=0.77) on the training set. The classifier trained on the probabilistically labeled notes accurately identified patients with an AF prediagnosis (0.95; F1-score=0.83). The cohort study conducted using the constructed system carried enough statistical power to verify the key findings of the Apple Heart Study, which enrolled a much larger number of participants, where patients who received a prediagnosis tended to be older, male, and White with higher CHA2DS2-VASc (congestive heart failure, hypertension, age ≥75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category) scores (P<.001). We also made a novel discovery that patients with a prediagnosis were more likely to use anticoagulants (525/1037, 50.63% vs 5936/16,560, 35.85%) and have an eventual AF diagnosis (305/1037, 29.41% vs 262/16,560, 1.58%). At the index diagnosis, the existence of a prediagnosis did not distinguish patients based on clinical characteristics, but did correlate with anticoagulant prescription (P=.004 for apixaban and P=.01 for rivaroxaban). Conclusions: Our work establishes the feasibility and efficacy of an EHR-based surveillance system for consumer wearables that render AF prediagnoses. Further work is necessary to generalize these findings for patient populations at other sites.

4.

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare.

Guo, Lin Lawrence; Morse, Keith E; Aftandilian, Catherine; Steinberg, Ethan; Fries, Jason; Posada, Jose; Fleming, Scott Lanyon; Lemmon, Joshua; Jessa, Karim; Shah, Nigam; Sung, Lillian.

BMC Med Inform Decis Mak ; 24(1): 51, 2024 Feb 14.

Article in English | MEDLINE | ID: mdl-38355486

ABSTRACT

BACKGROUND: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. METHODS: This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen's Kappa, sensitivity and specificity were calculated for each lab-based severity level. RESULTS: The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen's Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds. CONCLUSIONS: Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.

Subject(s)

Hyperkalemia , Neutropenia , Humans , Delivery of Health Care , Machine Learning , Sensitivity and Specificity

5.

Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks.

Lemmon, Joshua; Guo, Lin Lawrence; Steinberg, Ethan; Morse, Keith E; Fleming, Scott Lanyon; Aftandilian, Catherine; Pfohl, Stephen R; Posada, Jose D; Shah, Nigam; Fries, Jason; Sung, Lillian.

J Am Med Inform Assoc ; 30(12): 2004-2011, 2023 11 17.

Article in English | MEDLINE | ID: mdl-37639620

ABSTRACT

OBJECTIVE: Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks. MATERIALS AND METHODS: This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients. RESULTS: When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority). CONCLUSIONS: Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.

Subject(s)

Inpatients , Machine Learning , Humans , Adult , Child , Retrospective Studies , Supervised Machine Learning , Electronic Health Records

6.

The Stanford Medicine data science ecosystem for clinical and translational research.

Callahan, Alison; Ashley, Euan; Datta, Somalee; Desai, Priyamvada; Ferris, Todd A; Fries, Jason A; Halaas, Michael; Langlotz, Curtis P; Mackey, Sean; Posada, José D; Pfeffer, Michael A; Shah, Nigam H.

JAMIA Open ; 6(3): ooad054, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37545984

ABSTRACT

Objective: To describe the infrastructure, tools, and services developed at Stanford Medicine to maintain its data science ecosystem and research patient data repository for clinical and translational research. Materials and Methods: The data science ecosystem, dubbed the Stanford Data Science Resources (SDSR), includes infrastructure and tools to create, search, retrieve, and analyze patient data, as well as services for data deidentification, linkage, and processing to extract high-value information from healthcare IT systems. Data are made available via self-service and concierge access, on HIPAA compliant secure computing infrastructure supported by in-depth user training. Results: The Stanford Medicine Research Data Repository (STARR) functions as the SDSR data integration point, and includes electronic medical records, clinical images, text, bedside monitoring data and HL7 messages. SDSR tools include tools for electronic phenotyping, cohort building, and a search engine for patient timelines. The SDSR supports patient data collection, reproducible research, and teaching using healthcare data, and facilitates industry collaborations and large-scale observational studies. Discussion: Research patient data repositories and their underlying data science infrastructure are essential to realizing a learning health system and advancing the mission of academic medical centers. Challenges to maintaining the SDSR include ensuring sufficient financial support while providing researchers and clinicians with maximal access to data and digital infrastructure, balancing tool development with user training, and supporting the diverse needs of users. Conclusion: Our experience maintaining the SDSR offers a case study for academic medical centers developing data science and research informatics infrastructure.

7.

The shaky foundations of large language models and foundation models for electronic health records.

Wornow, Michael; Xu, Yizhe; Thapa, Rahul; Patel, Birju; Steinberg, Ethan; Fleming, Scott; Pfeffer, Michael A; Fries, Jason; Shah, Nigam H.

NPJ Digit Med ; 6(1): 135, 2023 Jul 29.

Article in English | MEDLINE | ID: mdl-37516790

ABSTRACT

The success of foundation models such as ChatGPT and AlphaFold has spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. In this narrative review, we examine 84 foundation models trained on non-imaging EMR data (i.e., clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g., MIMIC-III) or broad, public biomedical corpora (e.g., PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. Considering these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.

8.

EHR foundation models improve robustness in the presence of temporal distribution shift.

Guo, Lin Lawrence; Steinberg, Ethan; Fleming, Scott Lanyon; Posada, Jose; Lemmon, Joshua; Pfohl, Stephen R; Shah, Nigam; Fries, Jason; Sung, Lillian.

Sci Rep ; 13(1): 3767, 2023 03 07.

Article in English | MEDLINE | ID: mdl-36882576

ABSTRACT

Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.

Subject(s)

Electric Power Supplies , Electronic Health Records , Humans , Hospital Mortality , Hospitalization

9.

Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.

Lemmon, Joshua; Guo, Lin Lawrence; Posada, Jose; Pfohl, Stephen R; Fries, Jason; Fleming, Scott Lanyon; Aftandilian, Catherine; Shah, Nigam; Sung, Lillian.

Methods Inf Med ; 62(1-02): 60-70, 2023 05.

Article in English | MEDLINE | ID: mdl-36812932

ABSTRACT

BACKGROUND: Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. METHODS: Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group. RESULTS: The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task. CONCLUSIONS: While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.

Subject(s)

Clinical Medicine , Sepsis , Female , Pregnancy , Humans , Hospital Mortality , Length of Stay , Machine Learning

10.

Investigating real-world consequences of biases in commonly used clinical calculators.

Yoo, Richard M; Dash, Dev; Lu, Jonathan H; Genkins, Julian Z; Rabbani, Naveed; Fries, Jason A; Shah, Nigam H.

Am J Manag Care ; 29(1): e1-e7, 2023 01 01.

Article in English | MEDLINE | ID: mdl-36716157

ABSTRACT

OBJECTIVES: To evaluate whether one summary metric of calculator performance sufficiently conveys equity across different demographic subgroups, as well as to evaluate how calculator predictive performance affects downstream health outcomes. STUDY DESIGN: We evaluate 3 commonly used clinical calculators-Model for End-Stage Liver Disease (MELD), CHA2DS2-VASc, and simplified Pulmonary Embolism Severity Index (sPESI)-on the cohort extracted from the Stanford Medicine Research Data Repository, following the cohort selection process as described in respective calculator derivation papers. METHODS: We quantified the predictive performance of the 3 clinical calculators across sex and race. Then, using the clinical guidelines that guide care based on these calculators' output, we quantified potential disparities in subsequent health outcomes. RESULTS: Across the examined subgroups, the MELD calculator exhibited worse performance for female and White populations, CHA2DS2-VASc calculator for the male population, and sPESI for the Black population. The extent to which such performance differences translated into differential health outcomes depended on the distribution of the calculators' scores around the thresholds used to trigger a care action via the corresponding guidelines. In particular, under the old guideline for CHA2DS2-VASc, among those who would not have been offered anticoagulant therapy, the Hispanic subgroup exhibited the highest rate of stroke. CONCLUSIONS: Clinical calculators, even when they do not include variables such as sex and race as inputs, can have very different care consequences across those subgroups. These differences in health care outcomes across subgroups can be explained by examining the distribution of scores and their calibration around the thresholds encoded in the accompanying care guidelines.

Subject(s)

Atrial Fibrillation , End Stage Liver Disease , Stroke , Humans , Male , Female , Risk Assessment , Severity of Illness Index , Anticoagulants/therapeutic use , Bias , Risk Factors , Atrial Fibrillation/complications , Atrial Fibrillation/drug therapy

11.

Perspective Toward Machine Learning Implementation in Pediatric Medicine: Mixed Methods Study.

Alexander, Natasha; Aftandilian, Catherine; Guo, Lin Lawrence; Plenert, Erin; Posada, Jose; Fries, Jason; Fleming, Scott; Johnson, Alistair; Shah, Nigam; Sung, Lillian.

JMIR Med Inform ; 10(11): e40039, 2022 Nov 17.

Article in English | MEDLINE | ID: mdl-36394938

ABSTRACT

BACKGROUND: Given the costs of machine learning implementation, a systematic approach to prioritizing which models to implement into clinical practice may be valuable. OBJECTIVE: The primary objective was to determine the health care attributes respondents at 2 pediatric institutions rate as important when prioritizing machine learning model implementation. The secondary objective was to describe their perspectives on implementation using a qualitative approach. METHODS: In this mixed methods study, we distributed a survey to health system leaders, physicians, and data scientists at 2 pediatric institutions. We asked respondents to rank the following 5 attributes in terms of implementation usefulness: the clinical problem was common, the clinical problem caused substantial morbidity and mortality, risk stratification led to different actions that could reasonably improve patient outcomes, reducing physician workload, and saving money. Important attributes were those ranked as first or second most important. Individual qualitative interviews were conducted with a subsample of respondents. RESULTS: Among 613 eligible respondents, 275 (44.9%) responded. Qualitative interviews were conducted with 17 respondents. The most common important attributes were risk stratification leading to different actions (205/275, 74.5%) and clinical problem causing substantial morbidity or mortality (177/275, 64.4%). The attributes considered least important were reducing physician workload and saving money. Qualitative interviews consistently prioritized implementations that improved patient outcomes. CONCLUSIONS: Respondents prioritized machine learning model implementation where risk stratification would lead to different actions and clinical problems that caused substantial morbidity and mortality. Implementations that improved patient outcomes were prioritized. These results can help provide a framework for machine learning model implementation.

12.

Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine.

Guo, Lin Lawrence; Pfohl, Stephen R; Fries, Jason; Johnson, Alistair E W; Posada, Jose; Aftandilian, Catherine; Shah, Nigam; Sung, Lillian.

Sci Rep ; 12(1): 2726, 2022 02 17.

Article in English | MEDLINE | ID: mdl-35177653

ABSTRACT

Temporal dataset shift associated with changes in healthcare over time is a barrier to deploying machine learning-based clinical decision support systems. Algorithms that learn robust models by estimating invariant properties across time periods for domain generalization (DG) and unsupervised domain adaptation (UDA) might be suitable to proactively mitigate dataset shift. The objective was to characterize the impact of temporal dataset shift on clinical prediction models and benchmark DG and UDA algorithms on improving model robustness. In this cohort study, intensive care unit patients from the MIMIC-IV database were categorized by year groups (2008-2010, 2011-2013, 2014-2016 and 2017-2019). Tasks were predicting mortality, long length of stay, sepsis and invasive ventilation. Feedforward neural networks were used as prediction models. The baseline experiment trained models using empirical risk minimization (ERM) on 2008-2010 (ERM[08-10]) and evaluated them on subsequent year groups. DG experiment trained models using algorithms that estimated invariant properties using 2008-2016 and evaluated them on 2017-2019. UDA experiment leveraged unlabelled samples from 2017 to 2019 for unsupervised distribution matching. DG and UDA models were compared to ERM[08-16] models trained using 2008-2016. Main performance measures were area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve and absolute calibration error. Threshold-based metrics including false-positives and false-negatives were used to assess the clinical impact of temporal dataset shift and its mitigation strategies. In the baseline experiments, dataset shift was most evident for sepsis prediction (maximum AUROC drop, 0.090; 95% confidence interval (CI), 0.080-0.101). Considering a scenario of 100 consecutively admitted patients showed that ERM[08-10] applied to 2017-2019 was associated with one additional false-negative among 11 patients with sepsis, when compared to the model applied to 2008-2010. When compared with ERM[08-16], DG and UDA experiments failed to produce more robust models (range of AUROC difference, - 0.003 to 0.050). In conclusion, DG and UDA failed to produce more robust models compared to ERM in the setting of temporal dataset shift. Alternate approaches are required to preserve model performance over time in clinical medicine.

Subject(s)

Databases, Factual , Intensive Care Units , Length of Stay , Models, Biological , Neural Networks, Computer , Sepsis , Aged , Aged, 80 and over , Female , Humans , Male , Middle Aged , Sepsis/mortality , Sepsis/therapy

13.

A computational approach to measure the linguistic characteristics of psychotherapy timing, responsiveness, and consistency.

Miner, Adam S; Fleming, Scott L; Haque, Albert; Fries, Jason A; Althoff, Tim; Wilfley, Denise E; Agras, W Stewart; Milstein, Arnold; Hancock, Jeff; Asch, Steven M; Stirman, Shannon Wiltsey; Arnow, Bruce A; Shah, Nigam H.

Npj Ment Health Res ; 1(1): 19, 2022 Dec 02.

Article in English | MEDLINE | ID: mdl-38609510

ABSTRACT

Although individual psychotherapy is generally effective for a range of mental health conditions, little is known about the moment-to-moment language use of effective therapists. Increased access to computational power, coupled with a rise in computer-mediated communication (telehealth), makes feasible the large-scale analyses of language use during psychotherapy. Transparent methodological approaches are lacking, however. Here we present novel methods to increase the efficiency of efforts to examine language use in psychotherapy. We evaluate three important aspects of therapist language use - timing, responsiveness, and consistency - across five clinically relevant language domains: pronouns, time orientation, emotional polarity, therapist tactics, and paralinguistic style. We find therapist language is dynamic within sessions, responds to patient language, and relates to patient symptom diagnosis but not symptom severity. Our results demonstrate that analyzing therapist language at scale is feasible and may help answer longstanding questions about specific behaviors of effective therapists.

14.

Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.

Guo, Lin Lawrence; Pfohl, Stephen R; Fries, Jason; Posada, Jose; Fleming, Scott Lanyon; Aftandilian, Catherine; Shah, Nigam; Sung, Lillian.

Appl Clin Inform ; 12(4): 808-815, 2021 08.

Article in English | MEDLINE | ID: mdl-34470057

ABSTRACT

OBJECTIVE: The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts. METHODS: Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects. RESULTS: Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n = 11) than discrimination deterioration (n = 3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n = 15) were more common than feature-level approaches (n = 2), with the most common approaches being model refitting (n = 12), probability calibration (n = 7), model updating (n = 6), and model selection (n = 6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination. CONCLUSION: There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.

Subject(s)

Clinical Medicine , Machine Learning , Clinical Decision-Making , Cognition

15.

Ontology-driven weak supervision for clinical entity classification in electronic health records.

Fries, Jason A; Steinberg, Ethan; Khattar, Saelig; Fleming, Scott L; Posada, Jose; Callahan, Alison; Shah, Nigam H.

Nat Commun ; 12(1): 2017, 2021 04 01.

Article in English | MEDLINE | ID: mdl-33795682

ABSTRACT

In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.

Subject(s)

COVID-19 , Data Curation/methods , Expert Systems , Machine Learning , Datasets as Topic , Electronic Health Records , Humans , Natural Language Processing , SARS-CoV-2

16.

Assessment of Extractability and Accuracy of Electronic Health Record Data for Joint Implant Registries.

Giori, Nicholas J; Radin, John; Callahan, Alison; Fries, Jason A; Halilaj, Eni; Ré, Christopher; Delp, Scott L; Shah, Nigam H; Harris, Alex H S.

JAMA Netw Open ; 4(3): e211728, 2021 03 01.

Article in English | MEDLINE | ID: mdl-33720372

ABSTRACT

Importance: Implant registries provide valuable information on the performance of implants in a real-world setting, yet they have traditionally been expensive to establish and maintain. Electronic health records (EHRs) are widely used and may include the information needed to generate clinically meaningful reports similar to a formal implant registry. Objectives: To quantify the extractability and accuracy of registry-relevant data from the EHR and to assess the ability of these data to track trends in implant use and the durability of implants (hereafter referred to as implant survivorship), using data stored since 2000 in the EHR of the largest integrated health care system in the United States. Design, Setting, and Participants: Retrospective cohort study of a large EHR of veterans who had 45â¯351 total hip arthroplasty procedures in Veterans Health Administration hospitals from 2000 to 2017. Data analysis was performed from January 1, 2000, to December 31, 2017. Exposures: Total hip arthroplasty. Main Outcomes and Measures: Number of total hip arthroplasty procedures extracted from the EHR, trends in implant use, and relative survivorship of implants. Results: A total of 45â¯351 total hip arthroplasty procedures were identified from 2000 to 2017 with 192â¯805 implant parts. Data completeness improved over the time. After 2014, 85% of prosthetic heads, 91% of shells, 81% of stems, and 85% of liners used in the Veterans Health Administration health care system were identified by part number. Revision burden and trends in metal vs ceramic prosthetic femoral head use were found to reflect data from the American Joint Replacement Registry. Recalled implants were obvious negative outliers in implant survivorship using Kaplan-Meier curves. Conclusions and Relevance: Although loss to follow-up remains a challenge that requires additional attention to improve the quantitative nature of calculated implant survivorship, we conclude that data collected during routine clinical care and stored in the EHR of a large health system over 18 years were sufficient to provide clinically meaningful data on trends in implant use and to identify poor implants that were subsequently recalled. This automated approach was low cost and had no reporting burden. This low-cost, low-overhead method to assess implant use and performance within a large health care setting may be useful to internal quality assurance programs and, on a larger scale, to postmarket surveillance of implant performance.

Subject(s)

Arthroplasty, Replacement, Hip/statistics & numerical data , Electronic Health Records/statistics & numerical data , Adult , Aged , Aged, 80 and over , Cohort Studies , Female , Humans , Male , Middle Aged , Registries , Reproducibility of Results , Retrospective Studies , Young Adult

17.

Language models are an effective representation learning technique for electronic health record data.

Steinberg, Ethan; Jung, Ken; Fries, Jason A; Corbin, Conor K; Pfohl, Stephen R; Shah, Nigam H.

J Biomed Inform ; 113: 103637, 2021 01.

Article in English | MEDLINE | ID: mdl-33290879

ABSTRACT

Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.

Subject(s)

Electronic Health Records , Models, Statistical , Humans , Machine Learning , Natural Language Processing , Prognosis

18.

Cardiac Imaging of Aortic Valve Area From 34 287 UK Biobank Participants Reveals Novel Genetic Associations and Shared Genetic Comorbidity With Multiple Disease Phenotypes.

Córdova-Palomera, Aldo; Tcheandjieu, Catherine; Fries, Jason A; Varma, Paroma; Chen, Vincent S; Fiterau, Madalina; Xiao, Ke; Tejeda, Heliodoro; Keavney, Bernard D; Cordell, Heather J; Tanigawa, Yosuke; Venkataraman, Guhan; Rivas, Manuel A; Ré, Christopher; Ashley, Euan; Priest, James R.

Circ Genom Precis Med ; 13(6): e003014, 2020 12.

Article in English | MEDLINE | ID: mdl-33125279

ABSTRACT

BACKGROUND: The aortic valve is an important determinant of cardiovascular physiology and anatomic location of common human diseases. METHODS: From a sample of 34 287 white British ancestry participants, we estimated functional aortic valve area by planimetry from prospectively obtained cardiac magnetic resonance imaging sequences of the aortic valve. Aortic valve area measurements were submitted to genome-wide association testing, followed by polygenic risk scoring and phenome-wide screening, to identify genetic comorbidities. RESULTS: A genome-wide association study of aortic valve area in these UK Biobank participants showed 3 significant associations, indexed by rs71190365 (chr13:50764607, DLEU1, P=1.8×10-9), rs35991305 (chr12:94191968, CRADD, P=3.4×10-8), and chr17:45013271:C:T (GOSR2, P=5.6×10-8). Replication on an independent set of 8145 unrelated European ancestry participants showed consistent effect sizes in all 3 loci, although rs35991305 did not meet nominal significance. We constructed a polygenic risk score for aortic valve area, which in a separate cohort of 311 728 individuals without imaging demonstrated that smaller aortic valve area is predictive of increased risk for aortic valve disease (odds ratio, 1.14; P=2.3×10-6). After excluding subjects with a medical diagnosis of aortic valve stenosis (remaining n=308 683 individuals), phenome-wide association of >10 000 traits showed multiple links between the polygenic score for aortic valve disease and key health-related comorbidities involving the cardiovascular system and autoimmune disease. Genetic correlation analysis supports a shared genetic etiology with between aortic valve area and birth weight along with other cardiovascular conditions. CONCLUSIONS: These results illustrate the use of automated phenotyping of cardiac imaging data from the general population to investigate the genetic etiology of aortic valve disease, perform clinical prediction, and uncover new clinical and genetic correlates of cardiac anatomy.

Subject(s)

Aortic Valve/diagnostic imaging , Biological Specimen Banks , Cardiovascular Diseases/diagnostic imaging , Cardiovascular Diseases/genetics , Genome-Wide Association Study , Magnetic Resonance Imaging , Adult , Aged , Aortic Valve/pathology , Aortic Valve Stenosis/diagnostic imaging , Aortic Valve Stenosis/genetics , Comorbidity , Female , Genome, Human , Humans , Male , Middle Aged , Multifactorial Inheritance/genetics , Phenomics , Phenotype , Survival Analysis , United Kingdom

19.

Ontology-driven weak supervision for clinical entity classification in electronic health records.

Fries, Jason A; Steinberg, Ethan; Khattar, Saelig; Fleming, Scott L; Posada, Jose; Callahan, Alison; Shah, Nigam H.

ArXiv ; 2020 Aug 05.

Article in English | MEDLINE | ID: mdl-32793768

ABSTRACT

In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.

20.

Estimating the efficacy of symptom-based screening for COVID-19.

Callahan, Alison; Steinberg, Ethan; Fries, Jason A; Gombar, Saurabh; Patel, Birju; Corbin, Conor K; Shah, Nigam H.

NPJ Digit Med ; 3: 95, 2020.

Article in English | MEDLINE | ID: mdl-32695885

ABSTRACT

There is substantial interest in using presenting symptoms to prioritize testing for COVID-19 and establish symptom-based surveillance. However, little is currently known about the specificity of COVID-19 symptoms. To assess the feasibility of symptom-based screening for COVID-19, we used data from tests for common respiratory viruses and SARS-CoV-2 in our health system to measure the ability to correctly classify virus test results based on presenting symptoms. Based on these results, symptom-based screening may not be an effective strategy to identify individuals who should be tested for SARS-CoV-2 infection or to obtain a leading indicator of new COVID-19 cases.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL