RESUMO
Electronic health records (EHR) are not designed for population-based research, but they provide easy and quick access to longitudinal health information for a large number of individuals. Many statistical methods have been proposed to account for selection bias, missing data, phenotyping errors, or other problems that arise in EHR data analysis. However, addressing multiple sources of bias simultaneously is challenging. We developed a methodological framework (R package, SAMBA) for jointly handling both selection bias and phenotype misclassification in the EHR setting that leverages external data sources. These methods assume factors related to selection and misclassification are fully observed, but these factors may be poorly understood and partially observed in practice. As a follow-up to the methodological work, we demonstrate how to apply these methods for two real-world case studies, and we evaluate their performance. In both examples, we use individual patient-level data collected through the University of Michigan Health System and various external population-based data sources. In case study (a), we explore the impact of these methods on estimated associations between gender and cancer diagnosis. In case study (b), we compare corrected associations between previously identified genetic loci and age-related macular degeneration with gold standard external summary estimates. These case studies illustrate how to utilize diverse auxiliary information to achieve less biased inference in EHR-based research.
Assuntos
Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação , Viés de Seleção , Viés , FenótipoRESUMO
BACKGROUND: Traditional methods for disease risk prediction and assessment, such as diagnostic tests using serum, urine, blood, saliva or imaging biomarkers, have been important for identifying high-risk individuals for many diseases, leading to early detection and improved survival. For pancreatic cancer, traditional methods for screening have been largely unsuccessful in identifying high-risk individuals in advance of disease progression leading to high mortality and poor survival. Electronic health records (EHR) linked to genetic profiles provide an opportunity to integrate multiple sources of patient information for risk prediction and stratification. We leverage a constellation of temporally associated diagnoses available in the EHR to construct a summary risk score, called a phenotype risk score (PheRS), for identifying individuals at high-risk for having pancreatic cancer. The proposed PheRS approach incorporates the time with respect to disease onset into the prediction framework. We combine and contrast the PheRS with more well-known measures of inherited susceptibility, namely, the polygenic risk scores (PRS) for prediction of pancreatic cancer. METHODOLOGY: We first calculated pairwise, unadjusted associations between pancreatic cancer diagnosis and all possible other diagnoses across the medical phenome. We call these pairwise associations co-occurrences. After accounting for cross-phenotype correlations, the multivariable association estimates from a subset of relatively independent diagnoses were used to create a weighted sum PheRS. We constructed time-restricted risk scores using data from 38,359 participants in the Michigan Genomics Initiative (MGI) based on the diagnoses contained in the EHR at 0, 1, 2, and 5 years prior to the target pancreatic cancer diagnosis. The PheRS was assessed for predictability in the UK Biobank (UKB). We tested the relative contribution of PheRS when added to a model containing a summary measure of inherited genetic susceptibility (PRS) plus other covariates like age, sex, smoking status, drinking status, and body mass index (BMI). RESULTS: Our exploration of co-occurrence patterns identified expected associations while also revealing unexpected relationships that may warrant closer attention. Solely using the pancreatic cancer PheRS at 5 years before the target diagnoses yielded an AUC of 0.60 (95% CI = [0.58, 0.62]) in UKB. A larger predictive model including PheRS, PRS, and the covariates at the 5-year threshold achieved an AUC of 0.74 (95% CI = [0.72, 0.76]) in UKB. We note that PheRS does contribute independently in the joint model. Finally, scores at the top percentiles of the PheRS distribution demonstrated promise in terms of risk stratification. Scores in the top 2% were 10.20 (95% CI = [9.34, 12.99]) times more likely to identify cases than those in the bottom 98% in UKB at the 5-year threshold prior to pancreatic cancer diagnosis. CONCLUSIONS: We developed a framework for creating a time-restricted PheRS from EHR data for pancreatic cancer using the rich information content of a medical phenome. In addition to identifying hypothesis-generating associations for future research, this PheRS demonstrates a potentially important contribution in identifying high-risk individuals, even after adjusting for PRS for pancreatic cancer and other traditional epidemiologic covariates. The methods are generalizable to other phenotypic traits.
Assuntos
Registros Eletrônicos de Saúde , Neoplasias Pancreáticas , Bancos de Espécimes Biológicos , Estudo de Associação Genômica Ampla , Humanos , Michigan , Neoplasias Pancreáticas/genética , Fenótipo , Fatores de RiscoRESUMO
Biobanks linked to electronic health records provide rich resources for health-related research. With improvements in administrative and informatics infrastructure, the availability and utility of data from biobanks have dramatically increased. In this paper, we first aim to characterize the current landscape of available biobanks and to describe specific biobanks, including their place of origin, size, and data types. The development and accessibility of large-scale biorepositories provide the opportunity to accelerate agnostic searches, expedite discoveries, and conduct hypothesis-generating studies of disease-treatment, disease-exposure, and disease-gene associations. Rather than designing and implementing a single study focused on a few targeted hypotheses, researchers can potentially use biobanks' existing resources to answer an expanded selection of exploratory questions as quickly as they can analyze them. However, there are many obvious and subtle challenges with the design and analysis of biobank-based studies. Our second aim is to discuss statistical issues related to biobank research such as study design, sampling strategy, phenotype identification, and missing data. We focus our discussion on biobanks that are linked to electronic health records. Some of the analytic issues are illustrated using data from the Michigan Genomics Initiative and UK Biobank, two biobanks with two different recruitment mechanisms. We summarize the current body of literature for addressing these challenges and discuss some standing open problems. This work complements and extends recent reviews about biobank-based research and serves as a resource catalog with analytical and practical guidance for statisticians, epidemiologists, and other medical researchers pursuing research using biobanks.
Assuntos
Bancos de Espécimes Biológicos , Registros Eletrônicos de Saúde , Genômica , Michigan , Projetos de PesquisaRESUMO
Using administrative patient-care data such as Electronic Health Records (EHR) and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real-world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.
RESUMO
Background & Aims: Alpha-1 antitrypsin deficiency is caused by mutations in SERPINA1, most commonly homozygosity for the Pi∗Z variant, and can present as liver disease. While heterozygosity for Pi∗Z (Pi∗MZ) is linked to increased risk of cirrhosis, whether the Pi∗MZ genotype is associated with an increased rate of decompensation among patients who already have compensated cirrhosis is not known. Methods: This was a retrospective study of Michigan Genomics Initiative participants with baseline compensated cirrhosis. The primary predictors were Pi∗MZ or Pi∗MS genotype (vs. Pi∗MM). The primary outcomes were hepatic decompensation with ascites, hepatic encephalopathy, or variceal bleeding, or the combined endpoint of liver-related death or liver transplant, both modeled with Fine-Gray competing risk models. Results: We included 576 patients with baseline compensated cirrhosis who had undergone genotyping, of whom 474 had Pi∗MM, 49 had Pi∗MZ, and 52 had Pi∗MS genotypes. Compared to Pi∗MM genotype, Pi∗MZ was associated with increased rates of hepatic decompensation (hazard ratio 1.81; 95% CI 1.22-2.69; p = 0.003) and liver transplant or liver-related death (hazard ratio 2.07; 95% CI 1.21-3.52; p = 0.078). These associations remained significant after adjustment for severity of underlying liver disease, and were robust across subgroup analyses based on etiology, sex, obesity, and diabetes status. Pi∗MS was not associated with decompensation or death/transplantation. Conclusions: The SERPINA1 Pi∗MZ genotype is associated with an increased rate of hepatic decompensation and decreased transplant-free survival among patients with baseline compensated cirrhosis. Lay summary: There is a mutation in the gene SERPINA1 called Pi∗MZ which increases risk of liver scarring (cirrhosis); however, it is not known what effect Pi∗MZ has if someone already has cirrhosis. In this study, we found that people who had cirrhosis and Pi∗MZ developed complications from cirrhosis faster than those who did not have the mutation.