RESUMO
Health research using electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error. In this paper, we develop new strategies for handling disease status misclassification and selection bias in EHR-based association studies. We first focus on each type of bias separately. For misclassification, we propose three novel likelihood-based bias correction strategies. A distinguishing feature of the EHR setting is that misclassification may be related to patient-varying factors, and the proposed methods leverage data in the EHR to estimate misclassification rates without gold standard labels. For addressing selection bias, we describe how calibration and inverse probability weighting methods from the survey sampling literature can be extended and applied to the EHR setting. Addressing misclassification and selection biases simultaneously is a more challenging problem than dealing with each on its own, and we propose several new strategies. For all methods proposed, we derive valid standard error estimators and provide software for implementation. We provide a new suite of statistical estimation and inference strategies for addressing misclassification and selection bias simultaneously that is tailored to problems arising in EHR data analysis. We apply these methods to data from The Michigan Genomics Initiative, a longitudinal EHR-linked biorepository.
Assuntos
Registros Eletrônicos de Saúde , Viés , Humanos , Funções Verossimilhança , Michigan , Viés de SeleçãoRESUMO
OBJECTIVES: Because observational studies often use imperfect measurements, results are prone to misclassification errors. We used as a motivating example the possible teratogenic risks of antiemetic agents in pregnancy since a large observational study recently showed that first-trimester exposure to doxylamine-pyridoxine was associated with significantly increased risk of congenital malformations as a whole, as well as central nervous system defects, and previous observational studies did not show such associations. A meta-analysis on this issue was carried out with the aim to illustrate how differential exposure and outcome misclassifications may lead to uncertain conclusions. METHODS: Medline, searched to October 2019 for full text papers in English. Summary Odds Ratios (ORs) with confidence intervals (CIs) were calculated using random-effect models. Probabilistic sensitivity analyses were performed for evaluating the extension of differential misclassification required to account for the exposure-outcome association. RESULTS: Summary ORs were 1.02 (95 % CI, 0.92-1.15), 0.99 (0.82-1.19) and 1.25 (1.08-1.44) for overall congenital, cardiocirculatory, and central nervous system malformations respectively. By assuming exposure and outcome bias factor respectively of 0.95 (i.e., newborns with congenital defects had exposure specificity 5% lower than healthy newborns) and 1.12 (i.e., exposed newborns had outcome sensitivity 12 % higher than unexposed newborns), summary OR of central nervous system defects became 1.13 (95 % CI, 0.99-1.29) and 1.17 (95 % CI, 0.99-1.38). CONCLUSION: Observational investigations and meta-analyses of observational studies need cautious interpretations. Their susceptibility to several, often sneaky, sources of bias should be carefully evaluated.
Assuntos
Anormalidades Induzidas por Medicamentos/epidemiologia , Antieméticos/efeitos adversos , Diciclomina/efeitos adversos , Doxilamina/efeitos adversos , Náusea/tratamento farmacológico , Piridoxina/efeitos adversos , Vômito/tratamento farmacológico , Combinação de Medicamentos , Feminino , Humanos , Náusea/epidemiologia , Estudos Observacionais como Assunto , Razão de Chances , Gravidez , Erro Científico Experimental , Incerteza , Vômito/epidemiologiaRESUMO
Link prediction is a fundamental problem in network analysis. In a complex network, links can be unreported and/or under detection limits due to heterogeneous sources of noise and technical challenges during data collection. The incomplete network data can lead to an inaccurate inference of network based data analysis. We propose a parametric link prediction model and consider latent links as misclassified binary outcomes. We develop new algorithms to optimize model parameters and yield robust predictions of unobserved links. Theoretical properties of the predictive model are also discussed. We apply the new method to a partially observed social network data and incomplete brain network data. The results demonstrate that our method outperforms the existing latent-link prediction methods.
Assuntos
Algoritmos , HumanosRESUMO
Primary liver cancer is difficult to diagnose accurately at death, due to metastases from nearby organs and to concomitant diseases, such as chronic hepatitis and cirrhosis. Trends in diagnostic accuracy could affect radiation risk estimates for incident liver cancer by altering background rates or by impacting risk modification by sex and age. We quantified the potential impact of death-certificate inaccuracies on radiation risk estimates for liver cancer in the Life Span Study of atomic bomb survivors. True-positive and false-negative rates were obtained from a previous study that compared death-certificate causes of death with those based on pathological review, from 1958 to 1987. We assumed various scenarios for misclassification rates after 1987. We obtained estimated true positives and estimated false negatives by stratified sampling from binomial distributions with probabilities given by the true-positive and false-negative rates, respectively. Poisson regression methods were applied to highly stratified person-year tables of corrected case counts and accrued person years. During the study period (1958-2009), there were 1,885 cases of liver cancer, which included 383 death-certificate-only (DCO) cases; 1,283 cases with chronic liver disease as the underlying cause of death; and 150 DCO cases of pancreatic cancer among 105,444 study participants. Across the range of scenarios considered, radiation risk estimates based on corrected case counts were attenuated, on average, by 13-30%. Our results indicated that radiation risk estimates for liver cancer were potentially sensitive to death-certificate inaccuracies. Additional data are needed to inform misclassification rates in recent years.
Assuntos
Sobreviventes de Bombas Atômicas/estatística & dados numéricos , Neoplasias Hepáticas/mortalidade , Neoplasias Induzidas por Radiação/mortalidade , Causas de Morte , Humanos , Incidência , Japão/epidemiologia , Expectativa de Vida , Hepatopatias/diagnóstico , Hepatopatias/mortalidade , Hepatopatias/patologia , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/patologia , Neoplasias Induzidas por Radiação/diagnóstico , Neoplasias Induzidas por Radiação/patologiaRESUMO
Large-scale association analyses based on observational health care databases such as electronic health records have been a topic of increasing interest in the scientific community. However, challenges due to nonprobability sampling and phenotype misclassification associated with the use of these data sources are often ignored in standard analyses. The extent of the bias introduced by ignoring these factors is not well-characterized. In this paper, we develop an analytic framework for characterizing the bias expected in disease-gene association studies based on electronic health records when disease status misclassification and the sampling mechanism are ignored. Through a sensitivity analysis approach, this framework can be used to obtain plausible values for parameters of interest given summary results from standard analysis. We develop an online tool for performing this sensitivity analysis. Simulations demonstrate promising properties of the proposed method. We apply our approach to study bias in disease-gene association studies using electronic health record data from the Michigan Genomics Initiative, a longitudinal biorepository effort within The University Michigan health system.
Assuntos
Registros Eletrônicos de Saúde , Estudo de Associação Genômica Ampla , Viés , Michigan , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
PURPOSE: It is well documented that outcome misclassification can bias a point estimate. We aimed to understand current practice in addressing this bias in pharmacoepidemiology database studies and to develop an open source application (app) from existing methodology to demonstrate the impact and mechanism of this bias on results. METHODS: Studies of an exposure and a clinical outcome were selected from all Pharmacoepidemiology and Drug Safety publications during 2017 and any reference to outcome misclassification described. An app to correct risk ratio (RR) and cumulative incidence for outcome misclassification was developed from a published methodology and used to demonstrate the impact of correction on point estimates. RESULTS: Eight (19%) of 43 papers selected reported estimates of outcome ascertainment accuracy with positive predictive value (PPV) the most commonly reported measure (7 of 8 studies). Three studies (7%) corrected for the bias, 1 by exposure strata, and 5 (12%) restricted analyses to confirmed cases. The app (app http://apps.p-95.com/ISPE/) uses values of PPV and sensitivity (or a range of possible values) in each exposure strata and returns corrected point estimates and confidence intervals. The app demonstrates that small differences between comparison groups in PPV or sensitivity can introduce bias even when accuracy estimates are high. CONCLUSIONS: Outcome misclassification is not usually corrected in pharmacoepidemiology database studies although correction methods using routinely measured indices are available. Error indices are needed for each comparison group to correct RR estimates for these errors. The app should encourage understanding of this bias and increase adjustment.
Assuntos
Farmacoepidemiologia , Viés , Bases de Dados Factuais , Humanos , Incidência , Razão de ChancesRESUMO
The linkage between electronic health records (EHRs) and genotype data makes it plausible to study the genetic susceptibility of a wide range of disease phenotypes. Despite that EHR-derived phenotype data are subjected to misclassification, it has been shown useful for discovering susceptible genes, particularly in the setting of phenome-wide association studies (PheWAS). It is essential to characterize discovered associations using gold standard phenotype data by chart review. In this work, we propose a genotype stratified case-control sampling strategy to select subjects for phenotype validation. We develop a closed-form maximum-likelihood estimator for the odds ratio parameters and a score statistic for testing genetic association using the combined validated and error-prone EHR-derived phenotype data, and assess the extent of power improvement provided by this approach. Compared with case-control sampling based only on EHR-derived phenotype data, our genotype stratified strategy maintains nominal type I error rates, and result in higher power for detecting associations. It also corrects the bias in the odds ratio parameter estimates, and reduces the corresponding variance especially when the minor allele frequency is small.
Assuntos
Estudo de Associação Genômica Ampla , Modelos Genéticos , Registros Eletrônicos de Saúde , Frequência do Gene , Genótipo , Humanos , Razão de Chances , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
In the presence of confounding, the consistency assumption required for identification of causal effects may be violated due to misclassification of the outcome variable. We introduce an inverse probability weighted approach to rebalance covariates across treatment groups while mitigating the influence of differential misclassification bias. First, using a simplified example taken from an administrative health care dataset, we introduce the approach for estimation of the marginal causal odds ratio in a simple setting with the use of internal validation information. We then extend this to the presence of additional covariates and use simulated data to investigate the finite sample properties of the proposed weighted estimators. Estimation of the weights is done using logistic regression with misclassified outcomes, and a bootstrap approach is used for variance estimation.
Assuntos
Viés , Fatores de Confusão Epidemiológicos , Funções Verossimilhança , Causalidade , Simulação por Computador , Interpretação Estatística de Dados , Humanos , Modelos Logísticos , Razão de Chances , ProbabilidadeRESUMO
We examine the impact of nondifferential outcome misclassification on odds ratios estimated from pair-matched case-control studies and propose a Bayesian model to adjust these estimates for misclassification bias. The model relies on access to a validation subgroup with confirmed outcome status for all case-control pairs as well as prior knowledge about the positive and negative predictive value of the classification mechanism. We illustrate the model's performance on simulated data and apply it to a database study examining the presence of ten morbidities in the prodromal phase of multiple sclerosis.
Assuntos
Teorema de Bayes , Viés , Estudos de Casos e Controles , Bases de Dados Factuais , Colúmbia Britânica , Comorbidade , Simulação por Computador , Interpretação Estatística de Dados , Humanos , Esclerose Múltipla/complicações , Razão de ChancesRESUMO
Joint misclassification of exposure and outcome variables can lead to considerable bias in epidemiological studies of causal exposure-outcome effects. In this paper, we present a new maximum likelihood based estimator for marginal causal effects that simultaneously adjusts for confounding and several forms of joint misclassification of the exposure and outcome variables. The proposed method relies on validation data for the construction of weights that account for both sources of bias. The weighting estimator, which is an extension of the outcome misclassification weighting estimator proposed by Gravel and Platt (Weighted estimation for confounded binary outcomes subject to misclassification. Stat Med 2018; 37: 425-436), is applied to reinfarction data. Simulation studies were carried out to study its finite sample properties and compare it with methods that do not account for confounding or misclassification. The new estimator showed favourable large sample properties in the simulations. Further research is needed to study the sensitivity of the proposed method and that of alternatives to violations of their assumptions. The implementation of the estimator is facilitated by a new R function (ipwm) in an existing R package (mecor).
Assuntos
Modelos Estatísticos , Viés , Causalidade , Simulação por Computador , Funções VerossimilhançaRESUMO
Background The bias implications of outcome misclassification arising from imperfect capture of mortality in claims-based studies are not well understood. Methods and Results We identified 2 cohorts of patients: (1) type 2 diabetes mellitus (n=8.6 million), and (2) heart failure (n=3.1 million), from Medicare claims (2012-2016). Within the 2 cohorts, mortality was identified from claims using the following approaches: (1) all-place all-cause mortality, (2) in-hospital all-cause mortality, (3) all-place cardiovascular mortality (based on diagnosis codes for a major cardiovascular event within 30 days of death date), or (4) in-hospital cardiovascular mortality, and compared against National Death Index identified mortality. Empirically identified sensitivity and specificity based on observed values in the 2 cohorts were used to conduct Monte Carlo simulations for treatment effect estimation under differential and nondifferential misclassification scenarios. From National Death Index, 1 544 805 deaths (549 996 [35.6%] cardiovascular deaths) in the type 2 diabetes mellitus cohort and 1 175 202 deaths (523 430 [44.5%] cardiovascular deaths) in the heart failure cohort were included. Sensitivity was 99.997% and 99.207% for the all-place all-cause mortality approach, whereas it was 27.71% and 33.71% for the in-hospital all-cause mortality approach in the type 2 diabetes mellitus and heart failure cohorts, respectively, with perfect positive predicted values. For all-place cardiovascular mortality, sensitivity was 52.01% in the type 2 diabetes mellitus cohort and 53.83% in the heart failure cohort with positive predicted values of 49.98% and 54.45%, respectively. Simulations suggested a possibility for substantial bias in treatment effects. Conclusions Approaches to identify mortality from claims had variable performance compared with the National Death Index. Investigators should anticipate the potential for bias from outcome misclassification when using administrative claims to capture mortality.
Assuntos
Demandas Administrativas em Assistência à Saúde/estatística & dados numéricos , Doenças Cardiovasculares/mortalidade , Mortalidade Hospitalar/tendências , Medicare/estatística & dados numéricos , Idoso , Idoso de 80 Anos ou mais , Viés , Doenças Cardiovasculares/diagnóstico , Doenças Cardiovasculares/terapia , Causas de Morte/tendências , Diabetes Mellitus Tipo 2/mortalidade , Feminino , Insuficiência Cardíaca/mortalidade , Humanos , Masculino , Valor Preditivo dos Testes , Sensibilidade e Especificidade , Resultado do Tratamento , Estados Unidos/epidemiologiaRESUMO
OBJECTIVES: Mortality due to rare diseases, which are substantial sources of premature mortality, is underreported in mortality studies. The objective of this study was to determine the completeness of reporting systemic lupus erythematosus (SLE) as a cause of death. METHODS: In 2017, we linked data on a Swedish population-based cohort (the Swedish Lupus Linkage, 2001-2013) comprising people with SLE (n = 8560) and their matched general population comparators (n = 37 717) to data from the Cause of Death Register. We reviewed death records of deceased people from the cohort (n = 5110) and extracted data on patient demographic characteristics and causes of death. We estimated odds ratios (ORs) and 95% confidence intervals (CIs) for not reporting SLE as a cause of death by using multivariable-adjusted logistic regression models. RESULTS: Of 1802 deaths among SLE patients in the study, 1071 (59%) did not have SLE reported on their death records. Most SLE decedents were aged 75-84 at death (n = 584, 32%), female (n = 1462, 81%), and born in Nordic countries (n = 1730, 96%). Decedents aged ≥85 at death were more likely to have SLE not reported on their death records than were decedents aged <50 (OR = 2.34; 95% CI, 1.48-3.68). Having renal failure listed as a cause of death decreased the likelihood of SLE not being reported on the death record (OR = 0.54; 95% CI, 0.40-0.73), whereas having cancer listed as a cause of death increased this likelihood (OR = 2.39; 95% CI, 1.85-3.07). CONCLUSIONS: SLE was greatly underreported as a cause of mortality on death records of SLE patients, particularly in older decedents and those with cancer, thereby underestimating the true burden of this disease. Public health resources need to focus on improving the recording of rare diseases in order to enhance the epidemiological utility of mortality data.
Assuntos
Atestado de Óbito , Lúpus Eritematoso Sistêmico/epidemiologia , Lúpus Eritematoso Sistêmico/mortalidade , Doenças Raras , Idoso , Idoso de 80 Anos ou mais , Doenças Cardiovasculares/epidemiologia , Causas de Morte , Feminino , Humanos , Masculino , Mortalidade Prematura , Neoplasias , SuéciaRESUMO
The availability of data from electronic health records facilitates the development and evaluation of risk-prediction models, but estimation of prediction accuracy could be limited by outcome misclassification, which can arise if events are not captured. We evaluate the robustness of prediction accuracy summaries, obtained from receiver operating characteristic curves and risk-reclassification methods, if events are not captured (i.e., "false negatives"). We derive estimators for sensitivity and specificity if misclassification is independent of marker values. In simulation studies, we quantify the potential for bias in prediction accuracy summaries if misclassification depends on marker values. We compare the accuracy of alternative prognostic models for 30-day all-cause hospital readmission among 4548 patients discharged from the University of Pennsylvania Health System with a primary diagnosis of heart failure. Simulation studies indicate that if misclassification depends on marker values, then the estimated accuracy improvement is also biased, but the direction of the bias depends on the direction of the association between markers and the probability of misclassification. In our application, 29% of the 1143 readmitted patients were readmitted to a hospital elsewhere in Pennsylvania, which reduced prediction accuracy. Outcome misclassification can result in erroneous conclusions regarding the accuracy of risk-prediction models.
RESUMO
PURPOSE: When learning bias analysis, epidemiologists are taught to quantitatively adjust for multiple biases by correcting study results in the reverse order of the error sequence. To understand the error sequence for a particular study, one must carefully examine the health study's epidemiologic data-generating process. In this article, we describe the unique data-generating process of a man-made disaster epidemiologic study. METHODS: We described the data-generating process and conducted a bias analysis for a study associating September 11, 2001 dust cloud exposure and self-reported newly physician-diagnosed asthma among rescue-recovery workers and volunteers. We adjusted an odds ratio (OR) estimate for the combined effect of missing data, outcome misclassification, and nonparticipation. RESULTS: Under our assumptions about systematic error, the ORs adjusted for all three biases ranged from 1.33 to 3.84. Most of the adjusted estimates were greater than the observed OR of 1.77 and were outside the 95% confidence limits (1.55, 2.01). CONCLUSIONS: Man-made disasters present some situations that are not observed in other areas of epidemiology. Future epidemiologic studies of disasters could benefit from a proactive approach that focuses on the technical aspect of data collection and gathers information on bias parameters to provide more meaningful interpretations of results.
Assuntos
Poluentes Atmosféricos/efeitos adversos , Asma/diagnóstico , Asma/epidemiologia , Trabalho de Resgate/estatística & dados numéricos , Viés de Seleção , Sensibilidade e Especificidade , Asma/etiologia , Estudos de Avaliação como Assunto , Feminino , Humanos , Masculino , Razão de Chances , Prevalência , Medição de Risco , Autorrelato , Ataques Terroristas de 11 de Setembro , Estados Unidos , Voluntários/estatística & dados numéricosRESUMO
In the winter influenza epidemic season, patients with respiratory illnesses including respiratory syncytial virus (RSV) infections increase among young children. Therefore, we evaluated the effectiveness of influenza vaccine against influenza-like illness (ILI) using a technique to identify outbreaks of RSV infection and to distinguish those patients from ILI patients. The study subjects were 101 children aged 12 to 84 months attending nursery school. We classified the cases into 6 levels based on the definitions of ILI for outcomes. We established observation periods according to information obtained from regional surveillance and rapid diagnostic tests among children. Multivariate odds ratios (ORs) for each case classification were obtained using a logistic regression model for each observation period. For the entire observation period, ORs for cases with fever plus respiratory symptoms were reduced marginally significantly. For the local influenza epidemic period, only the OR for the most serious cases was significantly decreased (0.20 [95%CI: 0.04-0.94]). During the influenza outbreak among the nursery school children, multivariate ORs for fever plus respiratory symptoms decreased significantly (≥ 38.0°C plus ≥ one symptoms: 0.23 [0.06-0.91), ≥ 38.0°C plus ≥ 2 symptoms: 0.21 [0.05-0.85], ≥ 39.0°C plus ≥ one symptoms: 0.18 [0.04-0.93] and ≥ 39.0°C plus ≥ 2 symptoms: 0.16 [0.03-0.87]). These results suggest that confining observation to the peak influenza epidemic period and adoption of a strict case classification system can minimize outcome misclassification when evaluating the effectiveness of influenza vaccine against ILI, even if influenza and RSV cocirculate in the same season.