Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 284
Filter
Add more filters

Publication year range
1.
Biostatistics ; 24(3): 760-775, 2023 Jul 14.
Article in English | MEDLINE | ID: mdl-35166342

ABSTRACT

Leveraging large-scale electronic health record (EHR) data to estimate survival curves for clinical events can enable more powerful risk estimation and comparative effectiveness research. However, use of EHR data is hindered by a lack of direct event time observations. Occurrence times of relevant diagnostic codes or target disease mentions in clinical notes are at best a good approximation of the true disease onset time. On the other hand, extracting precise information on the exact event time requires laborious manual chart review and is sometimes altogether infeasible due to a lack of detailed documentation. Current status labels-binary indicators of phenotype status during follow-up-are significantly more efficient and feasible to compile, enabling more precise survival curve estimation given limited resources. Existing survival analysis methods using current status labels focus almost entirely on supervised estimation, and naive incorporation of unlabeled data into these methods may lead to biased estimates. In this article, we propose Semisupervised Calibration of Risk with Noisy Event Times (SCORNET), which yields a consistent and efficient survival function estimator by leveraging a small set of current status labels and a large set of informative features. In addition to providing theoretical justification of SCORNET, we demonstrate in both simulation and real-world EHR settings that SCORNET achieves efficiency akin to the parametric Weibull regression model, while also exhibiting semi-nonparametric flexibility and relatively low empirical bias in a variety of generative settings.


Subject(s)
Electronic Health Records , Humans , Calibration , Bias , Computer Simulation
2.
Bioinformatics ; 39(2)2023 02 03.
Article in English | MEDLINE | ID: mdl-36805623

ABSTRACT

MOTIVATION: Predicting molecule-disease indications and side effects is important for drug development and pharmacovigilance. Comprehensively mining molecule-molecule, molecule-disease and disease-disease semantic dependencies can potentially improve prediction performance. METHODS: We introduce a Multi-Modal REpresentation Mapping Approach to Predicting molecular-disease relations (M2REMAP) by incorporating clinical semantics learned from electronic health records (EHR) of 12.6 million patients. Specifically, M2REMAP first learns a multimodal molecule representation that synthesizes chemical property and clinical semantic information by mapping molecule chemicals via a deep neural network onto the clinical semantic embedding space shared by drugs, diseases and other common clinical concepts. To infer molecule-disease relations, M2REMAP combines multimodal molecule representation and disease semantic embedding to jointly infer indications and side effects. RESULTS: We extensively evaluate M2REMAP on molecule indications, side effects and interactions. Results show that incorporating EHR embeddings improves performance significantly, for example, attaining an improvement over the baseline models by 23.6% in PRC-AUC on indications and 23.9% on side effects. Further, M2REMAP overcomes the limitation of existing methods and effectively predicts drugs for novel diseases and emerging pathogens. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/celehs/M2REMAP, and prediction results are provided at https://shiny.parse-health.org/drugs-diseases-dev/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Drug-Related Side Effects and Adverse Reactions , Humans , Drug Development , Electronic Health Records , Neural Networks, Computer , Pharmacovigilance
3.
Med Care ; 62(2): 102-108, 2024 Feb 01.
Article in English | MEDLINE | ID: mdl-38079232

ABSTRACT

BACKGROUND: There is tremendous interest in evaluating surrogate markers given their potential to decrease study time, costs, and patient burden. OBJECTIVES: The purpose of this statistical workshop article is to describe and illustrate how to evaluate a surrogate marker of interest using the proportion of treatment effect (PTE) explained as a measure of the quality of the surrogate marker for: (1) a setting with a general fully observed primary outcome (eg, biopsy score); and (2) a setting with a time-to-event primary outcome which may be censored due to study termination or early drop out (eg, time to diabetes). METHODS: The methods are motivated by 2 randomized trials, one among children with nonalcoholic fatty liver disease where the primary outcome was a change in biopsy score (general outcome) and another study among adults at high risk for Type 2 diabetes where the primary outcome was time to diabetes (time-to-event outcome). The methods are illustrated using the Rsurrogate package with a detailed R code provided. RESULTS: In the biopsy score outcome setting, the estimated PTE of the examined surrogate marker was 0.182 (95% confidence interval [CI]: 0.121, 0.240), that is, the surrogate explained only 18.2% of the treatment effect on the biopsy score. In the diabetes setting, the estimated PTE of the surrogate marker was 0.596 (95% CI: 0.404, 0.760), that is, the surrogate explained 59.6% of the treatment effect on diabetes incidence. CONCLUSIONS: This statistical workshop provides tools that will support future researchers in the evaluation of surrogate markers.


Subject(s)
Diabetes Mellitus, Type 2 , Child , Humans , Treatment Outcome , Biomarkers
4.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38386359

ABSTRACT

In clinical studies of chronic diseases, the effectiveness of an intervention is often assessed using "high cost" outcomes that require long-term patient follow-up and/or are invasive to obtain. While much progress has been made in the development of statistical methods to identify surrogate markers, that is, measurements that could replace such costly outcomes, they are generally not applicable to studies with a small sample size. These methods either rely on nonparametric smoothing which requires a relatively large sample size or rely on strict model assumptions that are unlikely to hold in practice and empirically difficult to verify with a small sample size. In this paper, we develop a novel rank-based nonparametric approach to evaluate a surrogate marker in a small sample size setting. The method developed in this paper is motivated by a small study of children with nonalcoholic fatty liver disease (NAFLD), a diagnosis for a range of liver conditions in individuals without significant history of alcohol intake. Specifically, we examine whether change in alanine aminotransferase (ALT; measured in blood) is a surrogate marker for change in NAFLD activity score (obtained by biopsy) in a trial, which compared Vitamin E ($n=50$) versus placebo ($n=46$) among children with NAFLD.


Subject(s)
Non-alcoholic Fatty Liver Disease , Child , Humans , Non-alcoholic Fatty Liver Disease/diagnosis , Biomarkers , Biopsy , Sample Size
5.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38465982

ABSTRACT

In many modern machine learning applications, changes in covariate distributions and difficulty in acquiring outcome information have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring performance metrics, especially receiver operating characteristic (ROC) parameters, of a trained model. In this paper, we aim to evaluate the performance of a trained binary classifier on unlabeled target population based on ROC analysis. We proposed Semisupervised Transfer lEarning of Accuracy Measures (STEAM), an efficient three-step estimation procedure that employs (1) double-index modeling to construct calibrated density ratio weights and (2) robust imputation to leverage the large amount of unlabeled data to improve estimation efficiency. We establish the consistency and asymptotic normality of the proposed estimator under the correct specification of either the density ratio model or the outcome model. We also correct for potential overfitting bias in the estimators in finite samples with cross-validation. We compare our proposed estimators to existing methods and show reductions in bias and gains in efficiency through simulations. We illustrate the practical utility of the proposed method on evaluating prediction performance of a phenotyping model for rheumatoid arthritis (RA) on a temporally evolving EHR cohort.


Subject(s)
Machine Learning , Supervised Machine Learning , Humans , ROC Curve , Research Design , Bias
6.
Stat Med ; 43(17): 3184-3209, 2024 Jul 30.
Article in English | MEDLINE | ID: mdl-38812276

ABSTRACT

Determining whether a surrogate marker can be used to replace a primary outcome in a clinical study is complex. While many statistical methods have been developed to formally evaluate a surrogate marker, they generally do not provide a way to examine heterogeneity in the utility of a surrogate marker. Similar to treatment effect heterogeneity, where the effect of a treatment varies based on a patient characteristic, heterogeneity in surrogacy means that the strength or utility of the surrogate marker varies based on a patient characteristic. The few methods that have been recently developed to examine such heterogeneity cannot accommodate censored data. Studies with a censored outcome are typically the studies that could most benefit from a surrogate because the follow-up time is often long. In this paper, we develop a robust nonparametric approach to assess heterogeneity in the utility of a surrogate marker with respect to a baseline variable in a censored time-to-event outcome setting. In addition, we propose and evaluate a testing procedure to formally test for heterogeneity at a single time point or across multiple time points simultaneously. Finite sample performance of our estimation and testing procedure are examined in a simulation study. We use our proposed method to investigate the complex relationship between change in fasting plasma glucose, diabetes, and sex hormones using data from the diabetes prevention program study.


Subject(s)
Biomarkers , Blood Glucose , Computer Simulation , Humans , Biomarkers/blood , Blood Glucose/analysis , Female , Models, Statistical , Male , Gonadal Steroid Hormones/blood , Gonadal Steroid Hormones/therapeutic use , Statistics, Nonparametric , Data Interpretation, Statistical , Diabetes Mellitus
7.
J Biomed Inform ; 157: 104685, 2024 Sep.
Article in English | MEDLINE | ID: mdl-39004109

ABSTRACT

BACKGROUND: Risk prediction plays a crucial role in planning for prevention, monitoring, and treatment. Electronic Health Records (EHRs) offer an expansive repository of temporal medical data encompassing both risk factors and outcome indicators essential for effective risk prediction. However, challenges emerge due to the lack of readily available gold-standard outcomes and the complex effects of various risk factors. Compounding these challenges are the false positives in diagnosis codes, and formidable task of pinpointing the onset timing in annotations. OBJECTIVE: We develop a Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) algorithm based on extensive unlabeled longitudinal Electronic Health Records (EHR) data augmented by a limited set of gold standard labels on the binary status information indicating whether the clinical event of interest occurred during the follow-up period. METHODS: The SeDDLeR algorithm calculates an individualized risk of developing future clinical events over time using each patient's baseline EHR features via the following steps: (1) construction of an initial EHR-derived surrogate as a proxy for the onset status; (2) deep learning calibration of the surrogate along gold-standard onset status; and (3) semi-supervised deep learning for risk prediction combining calibrated surrogates and gold-standard onset status. To account for missing onset time and heterogeneous follow-up, we introduce temporal kernel weighting. We devise a Gated Recurrent Units (GRUs) module to capture temporal characteristics. We subsequently assess our proposed SeDDLeR method in simulation studies and apply the method to the Massachusetts General Brigham (MGB) Biobank to predict type 2 diabetes (T2D) risk. RESULTS: SeDDLeR outperforms benchmark risk prediction methods, including Semi-parametric Transformation Model (STM) and DeepHit, with consistently best accuracy across experiments. SeDDLeR achieved the best C-statistics ( 0.815, SE 0.023; vs STM +.084, SE 0.030, P-value .004; vs DeepHit +.055, SE 0.027, P-value .024) and best average time-specific AUC (0.778, SE 0.022; vs STM + 0.059, SE 0.039, P-value .067; vs DeepHit + 0.168, SE 0.032, P-value <0.001) in the MGB T2D study. CONCLUSION: SeDDLeR can train robust risk prediction models in both real-world EHR and synthetic datasets with minimal requirements of labeling event times. It holds the potential to be incorporated for future clinical trial recruitment or clinical decision-making.


Subject(s)
Algorithms , Deep Learning , Electronic Health Records , Humans , Risk Assessment/methods , Risk Factors , Supervised Machine Learning
8.
Biostatistics ; 23(2): 397-411, 2022 04 13.
Article in English | MEDLINE | ID: mdl-32909599

ABSTRACT

Divide-and-conquer (DAC) is a commonly used strategy to overcome the challenges of extraordinarily large data, by first breaking the dataset into series of data blocks, then combining results from individual data blocks to obtain a final estimation. Various DAC algorithms have been proposed to fit a sparse predictive regression model in the $L_1$ regularization setting. However, many existing DAC algorithms remain computationally intensive when sample size and number of candidate predictors are both large. In addition, no existing DAC procedures provide inference for quantifying the accuracy of risk prediction models. In this article, we propose a screening and one-step linearization infused DAC (SOLID) algorithm to fit sparse logistic regression to massive datasets, by integrating the DAC strategy with a screening step and sequences of linearization. This enables us to maximize the likelihood with only selected covariates and perform penalized estimation via a fast approximation to the likelihood. To assess the accuracy of a predictive regression model, we develop a modified cross-validation (MCV) that utilizes the side products of the SOLID, substantially reducing the computational burden. Compared with existing DAC methods, the MCV procedure is the first to make inference on accuracy. Extensive simulation studies suggest that the proposed SOLID and MCV procedures substantially outperform the existing methods with respect to computational speed and achieve similar statistical efficiency as the full sample-based estimator. We also demonstrate that the proposed inference procedure provides valid interval estimators. We apply the proposed SOLID procedure to develop and validate a classification model for disease diagnosis using narrative clinical notes based on electronic medical record data from Partners HealthCare.


Subject(s)
Algorithms , Research Design , Computer Simulation , Humans , Logistic Models
9.
Biometrics ; 79(2): 799-810, 2023 06.
Article in English | MEDLINE | ID: mdl-34874550

ABSTRACT

In studies that require long-term and/or costly follow-up of participants to evaluate a treatment, there is often interest in identifying and using a surrogate marker to evaluate the treatment effect. While several statistical methods have been proposed to evaluate potential surrogate markers, available methods generally do not account for or address the potential for a surrogate to vary in utility or strength by patient characteristics. Previous work examining surrogate markers has indicated that there may be such heterogeneity, that is, that a surrogate marker may be useful (with respect to capturing the treatment effect on the primary outcome) for some subgroups, but not for others. This heterogeneity is important to understand, particularly if the surrogate is to be used in a future trial to replace the primary outcome. In this paper, we propose an approach and estimation procedures to measure the surrogate strength as a function of a baseline covariate W and thus examine potential heterogeneity in the utility of the surrogate marker with respect to W. Within a potential outcome framework, we quantify the surrogate strength/utility using the proportion of treatment effect on the primary outcome that is explained by the treatment effect on the surrogate. We propose testing procedures to test for evidence of heterogeneity, examine finite sample performance of these methods via simulation, and illustrate the methods using AIDS clinical trial data.


Subject(s)
Biomarkers , Humans , Computer Simulation
10.
Biometrics ; 79(1): 190-202, 2023 03.
Article in English | MEDLINE | ID: mdl-34747010

ABSTRACT

Readily available proxies for the time of disease onset such as the time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow-up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on the current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error model for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initially estimator solely based on the labeled subset, we perform a one-step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semisupervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in a finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Mass General Brigham Healthcare Biobank.


Subject(s)
Algorithms , Electronic Health Records , Computer Simulation , Risk Factors
11.
Biometrics ; 79(2): 788-798, 2023 06.
Article in English | MEDLINE | ID: mdl-35426444

ABSTRACT

Identifying effective and valid surrogate markers to make inference about a treatment effect on long-term outcomes is an important step in improving the efficiency of clinical trials. Replacing a long-term outcome with short-term and/or cheaper surrogate markers can potentially shorten study duration and reduce trial costs. There is sizable statistical literature on methods to quantify the effectiveness of a single surrogate marker. Both parametric and nonparametric approaches have been well developed for different outcome types. However, when there are multiple markers available, methods for combining markers to construct a composite marker with improved surrogacy remain limited. In this paper, building on top of the optimal transformation framework of Wang et al. (2020), we propose a novel calibrated model fusion approach to optimally combine multiple markers to improve surrogacy. Specifically, we obtain two initial estimates of optimal composite scores of the markers based on two sets of models with one set approximating the underlying data distribution and the other directly approximating the optimal transformation function. We then estimate an optimal calibrated combination of the two estimated scores which ensures both validity of the final combined score and optimality with respect to the proportion of treatment effect explained by the final combined score. This approach is unique in that it identifies an optimal combination of the multiple surrogates without strictly relying on parametric assumptions while borrowing modeling strategies to avoid fully nonparametric estimation which is subject to the curse of dimensionality. Our identified optimal transformation can also be used to directly quantify the surrogacy of this identified combined score. Theoretical properties of the proposed estimators are derived, and the finite sample performance of the proposed method is evaluated through simulation studies. We further illustrate the proposed method using data from the Diabetes Prevention Program study.


Subject(s)
Models, Statistical , Computer Simulation , Biomarkers
12.
Stat Med ; 42(1): 68-88, 2023 01 15.
Article in English | MEDLINE | ID: mdl-36372072

ABSTRACT

The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with shorter follow-up time or less cost. However, previous work has demonstrated potential heterogeneity in the utility of a surrogate marker. When such heterogeneity exists, existing methods that use the surrogate to test for a treatment effect while ignoring this heterogeneity may lead to inaccurate conclusions about the treatment effect, particularly when the patient population in the new study has a different mix of characteristics than the study used to evaluate the utility of the surrogate marker. In this article, we develop a novel test for a treatment effect using surrogate marker information that accounts for heterogeneity in the utility of the surrogate. We compare our testing procedure to a test that uses primary outcome information (gold standard) and a test that uses surrogate marker information, but ignores heterogeneity. We demonstrate the validity of our approach and derive the asymptotic properties of our estimator and variance estimates. Simulation studies examine the finite sample properties of our testing procedure and demonstrate when our proposed approach can outperform the testing approach that ignores heterogeneity. We illustrate our methods using data from an AIDS clinical trial to test for a treatment effect using CD4 count as a surrogate marker for RNA.


Subject(s)
Computer Simulation , Humans , Biomarkers , CD4 Lymphocyte Count
13.
J Biomed Inform ; 143: 104415, 2023 07.
Article in English | MEDLINE | ID: mdl-37276949

ABSTRACT

Disease knowledge graphs have emerged as a powerful tool for artificial intelligence to connect, organize, and access diverse information about diseases. Relations between disease concepts are often distributed across multiple datasets, including unstructured plain text datasets and incomplete disease knowledge graphs. Extracting disease relations from multimodal data sources is thus crucial for constructing accurate and comprehensive disease knowledge graphs. We introduce REMAP, a multimodal approach for disease relation extraction. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, aligning the multimodal embeddings for optimal disease relation extraction. Additionally, REMAP utilizes a decoupled model structure to enable inference in single-modal data, which can be applied under missing modality scenarios. We apply the REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves language-based disease relation extraction by 10.0% (accuracy) and 17.2% (F1-score) by fusing disease knowledge graphs with language information. Furthermore, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4% (accuracy) and 10.4% (F1-score). REMAP is a flexible multimodal approach for extracting disease relations by fusing structured knowledge and language information. This approach provides a powerful model to easily find, access, and evaluate relations between disease concepts.


Subject(s)
Artificial Intelligence , Machine Learning , Humans , Unified Medical Language System , Language , Natural Language Processing
14.
J Biomed Inform ; 144: 104425, 2023 08.
Article in English | MEDLINE | ID: mdl-37331495

ABSTRACT

OBJECTIVE: Electronic health records (EHR), containing detailed longitudinal clinical information on a large number of patients and covering broad patient populations, open opportunities for comprehensive predictive modeling of disease progression and treatment response. However, since EHRs were originally constructed for administrative purposes not for research, in the EHR-linked studies, it is often not feasible to capture reliable information for analytical variables, especially in the survival setting, when both accurate event status and event times are needed for model building. For example, progression-free survival (PFS), a commonly used survival outcome for cancer patients, often involves complex information embedded in free-text clinical notes and cannot be extracted reliably. Proxies of PFS time such as time to the first mention of progression in the notes are at best good approximations to the true event time. This leads to difficulty in efficiently estimating event rates for an EHR patient cohort. Estimating survival rates based on error-prone outcome definitions can lead to biased results and hamper the power in the downstream analysis. On the other hand, extracting accurate event time information via manual annotation is time and resource intensive. The objective of this study is to develop a calibrated survival rate estimator using noisy outcomes from EHR data. MATERIALS AND METHODS: In this paper, we propose a two-stage semi-supervised calibration of noisy event rate (SCANER) estimator that can effectively overcome censoring induced dependency and attains more robust performance (i.e., not sensitive to misspecification of the imputation model) by fully utilizing both a small-labeled set of gold-standard survival outcomes annotated via manual chart review and a set of proxy features automatically captured via EHR in the unlabeled set. We validate the SCANER estimator by estimating the PFS rates for a virtual cohort of lung cancer patients from one large tertiary care center and the ICU-free survival rates for COVID patients from two large tertiary care centers. RESULTS: In terms of survival rate estimates, the SCANER had very similar point estimates compared to the complete-case Kaplan Meier estimator. On the other hand, other benchmark methods for comparison, which fail to account for the induced dependency between event time and the censoring time conditioning on surrogate outcomes, produced biased results across all three case studies. In terms of standard errors, the SCANER estimator was more efficient than the KM estimator, with up to 50% efficiency gain. CONCLUSION: The SCANER estimator achieves more efficient, robust, and accurate survival rate estimates compared to existing approaches. This promising new approach can also improve the resolution (i.e., granularity of event time) by using labels conditioning on multiple surrogates, particularly among less common or poorly coded conditions.


Subject(s)
COVID-19 , Lung Neoplasms , Humans , Electronic Health Records , Calibration , Survival Analysis
15.
J Biomed Inform ; 139: 104306, 2023 03.
Article in English | MEDLINE | ID: mdl-36738870

ABSTRACT

BACKGROUND: In electronic health records, patterns of missing laboratory test results could capture patients' course of disease as well as ​​reflect clinician's concerns or worries for possible conditions. These patterns are often understudied and overlooked. This study aims to identify informative patterns of missingness among laboratory data collected across 15 healthcare system sites in three countries for COVID-19 inpatients. METHODS: We collected and analyzed demographic, diagnosis, and laboratory data for 69,939 patients with positive COVID-19 PCR tests across three countries from 1 January 2020 through 30 September 2021. We analyzed missing laboratory measurements across sites, missingness stratification by demographic variables, temporal trends of missingness, correlations between labs based on missingness indicators over time, and clustering of groups of labs based on their missingness/ordering pattern. RESULTS: With these analyses, we identified mapping issues faced in seven out of 15 sites. We also identified nuances in data collection and variable definition for the various sites. Temporal trend analyses may support the use of laboratory test result missingness patterns in identifying severe COVID-19 patients. Lastly, using missingness patterns, we determined relationships between various labs that reflect clinical behaviors. CONCLUSION: In this work, we use computational approaches to relate missingness patterns to hospital treatment capacity and highlight the heterogeneity of looking at COVID-19 over time and at multiple sites, where there might be different phases, policies, etc. Changes in missingness could suggest a change in a patient's condition, and patterns of missingness among laboratory measurements could potentially identify clinical outcomes. This allows sites to consider missing data as informative to analyses and help researchers identify which sites are better poised to study particular questions.


Subject(s)
COVID-19 , Electronic Health Records , Humans , Data Collection , Records , Cluster Analysis
16.
J Med Internet Res ; 25: e45662, 2023 05 25.
Article in English | MEDLINE | ID: mdl-37227772

ABSTRACT

Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.


Subject(s)
Colonic Neoplasms , Electronic Health Records , Humans , Algorithms , Informatics , Research Design
17.
J Infect Dis ; 226(12): 2113-2117, 2022 12 13.
Article in English | MEDLINE | ID: mdl-35512327

ABSTRACT

In this retrospective cohort study of 94 595 severe acute respiratory syndrome coronavirus 2-positive cases, we developed and validated an algorithm to assess the association between coronavirus disease 2019 (COVID-19) severity and long-term complications (stroke, myocardial infarction, pulmonary embolism/deep vein thrombosis, heart failure, and mortality). COVID-19 severity was associated with a greater risk of experiencing a long-term complication 31-120 days postinfection. Most incident events occurred 31-60 days postinfection and diminished after day 91, except heart failure for severe patients and death for moderate patients, which peaked on days 91-120. Understanding the differential impact of COVID-19 severity on long-term events provides insight into possible intervention modalities and critical prevention strategies.


Subject(s)
COVID-19 , Heart Failure , Veterans , Humans , United States/epidemiology , Retrospective Studies
18.
Clin Gastroenterol Hepatol ; 20(10): 2366-2372.e6, 2022 10.
Article in English | MEDLINE | ID: mdl-35066137

ABSTRACT

BACKGROUND & AIMS: The comparative safety of therapies is important to inform relative positioning within the therapeutic algorithm. Tumor necrosis factor α antagonists (anti-TNF) are associated with an increased risk of infections. Whether there is a similar increase with ustekinumab (UST) or tofacitinib has not been established. METHODS: We identified patients with Crohn's disease or ulcerative colitis from a national commercial health insurance plan in the United States between 2008 and 2019. Infectious outcomes were ascertained for patients newly initiating anti-TNF, UST, or tofacitinib therapy. Cox proportional hazards models were fit in propensity score-weighted cohorts to compare rates between patients treated with UST or tofacitinib and anti-TNF therapy. RESULTS: Our study included 19,096, 2420, and 305 patients with inflammatory bowel disease initiating anti-TNF, UST, and tofacitinib therapy, respectively. Over follow-up on-treatment, 7% and 44% of anti-TNF patients had infection-related hospitalizations and developed infections, respectively, compared with 4% and 32% of UST patients and 6% and 41% of tofacitinib patients. In the weighted Cox analysis, UST was associated with a significantly lower risk of infection (hazard ratio [HR], 0.93; 95% confidence interval [CI], 0.86-0.99) compared with anti-TNF therapy. There was a trend towards a reduction in infection-related hospitalizations (HR, 0.84; 95% CI, 0.66-1.03). The risk of infections (HR, 0.97; 95% CI, 0.75-1.24) or infection-related hospitalizations (HR, 0.59; 95% CI, 0.27-1.05) were similar between patients on tofacitinib and anti-TNF. CONCLUSIONS: UST is associated with reduced risk of infections compared to anti-TNF biologics in inflammatory bowel disease, whereas no difference was observed between tofacitinib and anti-TNF therapy.


Subject(s)
Biological Products , Inflammatory Bowel Diseases , Biological Products/therapeutic use , Humans , Inflammatory Bowel Diseases/drug therapy , Piperidines , Pyrimidines , Tumor Necrosis Factor Inhibitors , Tumor Necrosis Factor-alpha , Ustekinumab/adverse effects
19.
Biostatistics ; 22(2): 381-401, 2021 04 10.
Article in English | MEDLINE | ID: mdl-31545341

ABSTRACT

We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the partial likelihood (PL). These sequences of linearization enable us to maximize the PL with only a small subset and perform penalized estimation via a fast approximation to the PL. The algorithm is applicable for the analysis of both time-independent and time-dependent survival data. Simulations suggest that the proposed DAC algorithm substantially outperforms the full sample-based estimators and the existing DAC algorithm with respect to the computational speed, while it achieves similar statistical efficiency as the full sample-based estimators. The proposed algorithm was applied to extraordinarily large survival datasets for the prediction of heart failure-specific readmission within 30 days among Medicare heart failure patients.


Subject(s)
Algorithms , Medicare , Aged , Computer Simulation , Humans , Least-Squares Analysis , Proportional Hazards Models , United States
20.
Am J Gastroenterol ; 117(11): 1845-1850, 2022 11 01.
Article in English | MEDLINE | ID: mdl-35854436

ABSTRACT

INTRODUCTION: There are limited data on comparative risk of infections with various biologic agents in older adults with inflammatory bowel diseases (IBDs). We aimed to assess the comparative safety of biologic agents in older IBD patients with varying comorbidity burden. METHODS: We used data from a large, national commercial insurance plan in the United States to identify patients 60 years and older with IBD who newly initiated tumor necrosis factor-α antagonists (anti-TNF), vedolizumab, or ustekinumab. Comorbidity was defined using the Charlson Comorbidity Index (CCI). Our primary outcome was infection-related hospitalizations. Cox proportional hazards models were fitted in propensity score-weighted cohorts to compare the risk of infections between the different therapeutic classes. RESULTS: The anti-TNF, vedolizumab, and ustekinumab cohorts included 2,369, 972, and 352 patients, respectively, with a mean age of 67 years. The overall rate of infection-related hospitalizations was similar to that of anti-TNF agents for patients initiating vedolizumab (hazard ratio [HR] 0.94, 95% confidence interval [CI] 0.84-1.04) and ustekinumab (0.92, 95% CI 0.74-1.16). Among patients with a CCI of >1, both ustekinumab (HR: 0.66, 95% CI: 0.46-0.91, p-interaction <0.01) and vedolizumab (HR: 0.78, 95% CI: 0.65-0.94, p-interaction: 0.02) were associated with a significantly lower rate of infection-related hospitalizations compared with anti-TNFs. No difference was found among patients with a CCI of ≤1. DISCUSSION: Among adults 60 years and older with IBD initiating biologic therapy, both vedolizumab and ustekinumab were associated with lower rates of infection-related hospitalizations than anti-TNF therapy for those with high comorbidity burden.


Subject(s)
Biological Therapy , Infections , Inflammatory Bowel Diseases , Ustekinumab , Aged , Humans , Biological Therapy/adverse effects , Comorbidity , Gastrointestinal Agents/therapeutic use , Inflammatory Bowel Diseases/complications , Inflammatory Bowel Diseases/drug therapy , Inflammatory Bowel Diseases/epidemiology , Retrospective Studies , Treatment Outcome , Tumor Necrosis Factor Inhibitors/therapeutic use , Ustekinumab/therapeutic use , Infections/etiology
SELECTION OF CITATIONS
SEARCH DETAIL