RESUMEN
Life-course epidemiology relies on specifying complex (causal) models that describe how variables interplay over time. Traditionally, such models have been constructed by perusing existing theory and previous studies. By comparing data-driven and theory-driven models, we investigated whether data-driven causal discovery algorithms can help in this process. We focused on a longitudinal data set on a cohort of Danish men (the Metropolit Study, 1953-2017). The theory-driven models were constructed by 2 subject-field experts. The data-driven models were constructed by use of the temporal Peter-Clark (TPC) algorithm. The TPC algorithm utilizes the temporal information embedded in life-course data. We found that the data-driven models recovered some, but not all, causal relationships included in the theory-driven expert models. The data-driven method was especially good at identifying direct causal relationships that the experts had high confidence in. Moreover, in a post hoc assessment, we found that most of the direct causal relationships proposed by the data-driven model but not included in the theory-driven model were plausible. Thus, the data-driven model may propose additional meaningful causal hypotheses that are new or have been overlooked by the experts. In conclusion, data-driven methods can aid causal model construction in life-course epidemiology, and combining both data-driven and theory-driven methods can lead to even stronger models.
Asunto(s)
Algoritmos , Modelos Teóricos , Masculino , Humanos , CausalidadRESUMEN
We adapt graphical causal structure learning methods to apply to nonstationary time series data, specifically to processes that exhibit stochastic trends. We modify the likelihood component of the BIC score used by score-based search algorithms, such that it remains a consistent selection criterion for integrated or cointegrated processes. We use this modified score in conjunction with the SVAR-GFCI algorithm [15], which allows us to recover qualitative structural information about the underlying data-generating process even in the presence of latent (unmeasured) factors. We demonstrate our approach on both simulated and real macroeconomic data.
RESUMEN
A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.
RESUMEN
The heart of the scientific enterprise is a rational effort to understand the causes behind the phenomena we observe. In large-scale complex dynamical systems such as the Earth system, real experiments are rarely feasible. However, a rapidly increasing amount of observational and simulated data opens up the use of novel data-driven causal methods beyond the commonly adopted correlation techniques. Here, we give an overview of causal inference frameworks and identify promising generic application cases common in Earth system sciences and beyond. We discuss challenges and initiate the benchmark platform causeme.net to close the gap between method users and developers.
RESUMEN
MOTIVATION: Integration of data from different modalities is a necessary step for multi-scale data analysis in many fields, including biomedical research and systems biology. Directed graphical models offer an attractive tool for this problem because they can represent both the complex, multivariate probability distributions and the causal pathways influencing the system. Graphical models learned from biomedical data can be used for classification, biomarker selection and functional analysis, while revealing the underlying network structure and thus allowing for arbitrary likelihood queries over the data. RESULTS: In this paper, we present and test new methods for finding directed graphs over mixed data types (continuous and discrete variables). We used this new algorithm, CausalMGM, to identify variables directly linked to disease diagnosis and progression in various multi-modal datasets, including clinical datasets from chronic obstructive pulmonary disease (COPD). COPD is the third leading cause of death and a major cause of disability and thus determining the factors that cause longitudinal lung function decline is very important. Applied on a COPD dataset, mixed graphical models were able to confirm and extend previously described causal effects and provide new insights on the factors that potentially affect the longitudinal lung function decline of COPD patients. AVAILABILITY AND IMPLEMENTATION: The CausalMGM package is available on http://www.causalmgm.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Modelos Biológicos , Enfermedad Pulmonar Obstructiva Crónica , Algoritmos , Humanos , Pronóstico , Enfermedad Pulmonar Obstructiva Crónica/diagnóstico , Biología de SistemasRESUMEN
Many real datasets contain values missing not at random (MNAR). In this scenario, investigators often perform list-wise deletion, or delete samples with any missing values, before applying causal discovery algorithms. List-wise deletion is a sound and general strategy when paired with algorithms such as FCI and RFCI, but the deletion procedure also eliminates otherwise good samples that contain only a few missing values. In this report, we show that we can more efficiently utilize the observed values with test-wise deletion while still maintaining algorithmic soundness. Here, test-wise deletion refers to the process of list-wise deleting samples only among the variables required for each conditional independence (CI) test used in constraint-based searches. Test-wise deletion therefore often saves more samples than list-wise deletion for each CI test, especially when we have a sparse underlying graph. Our theoretical results show that test-wise deletion is sound under the justifiable assumption that none of the missingness mechanisms causally affect each other in the underlying causal graph. We also find that FCI and RFCI with test-wise deletion outperform their list-wise deletion and imputation counterparts on average when MNAR holds in both synthetic and real data.
RESUMEN
Several studies have indicated that bi-factor models fit a broad range of psychometric data better than alternative multidimensional models such as second-order models, e.g Rodriguez, Reise and Haviland (2016), Gignac (2016), and Carnivez (2016). Murray and Johnson (2013) and Gignac (2016) argue that this phenomenon is partially due to un-modeled complexities (e.g. un-modeled cross-factor loadings) that induce a bias in standard statistical measures that favors bi-factor models over second-order models. We extend the Murray and Johnson simulation studies to show how the ability to distinguish second-order and bi-factor models diminishes as the amount of un-modeled complexity increases. By using theorems about rank constraints on the covariance matrix to find sub-models of measurement models that have less un-modeled complexity, we are able to reduce the statistical bias in favor of bi-factor models; this allows researchers to reliably distinguish between bi-factor and second-order models.
RESUMEN
We present an algorithm for estimating bounds on causal effects from observational data which combines graphical model search with simple linear regression. We assume that the underlying system can be represented by a linear structural equation model with no feedback, and we allow for the possibility of latent confounders. Under assumptions standard in the causal search literature, we use conditional independence constraints to search for an equivalence class of ancestral graphs. Then, for each model in the equivalence class, we perform the appropriate regression (using causal structure information to determine which covariates to adjust for) to estimate a set of possible causal effects. Our approach is based on the IDA procedure of Maathuis et al. (2009), which assumes that all relevant variables have been measured (i.e., no latent confounders). We generalize their work by relaxing this assumption, which is often violated in applied contexts. We validate the performance of our algorithm in simulation experiments.
RESUMEN
Discovering causal structure from observational data in the presence of latent variables remains an active research area. Constraint-based causal discovery algorithms are relatively efficient at discovering such causal models from data using independence tests. Typically, however, they derive and output only one such model. In contrast, Bayesian methods can generate and probabilistically score multiple models, outputting the most probable one; however, they are often computationally infeasible to apply when modeling latent variables. We introduce a hybrid method that derives a Bayesian probability that the set of independence tests associated with a given causal model are jointly correct. Using this constraint-based scoring method, we are able to score multiple causal models, which possibly contain latent variables, and output the most probable one. The structure-discovery performance of the proposed method is compared to an existing constraint-based method (RFCI) using data generated from several previously published Bayesian networks. The structural Hamming distances of the output models improved when using the proposed method compared to RFCI, especially for small sample sizes.
RESUMEN
This paper aims to give a broad coverage of central concepts and principles involved in automated causal inference and emerging approaches to causal discovery from i.i.d data and from time series. After reviewing concepts including manipulations, causal models, sample predictive modeling, causal predictive modeling, and structural equation models, we present the constraint-based approach to causal discovery, which relies on the conditional independence relationships in the data, and discuss the assumptions underlying its validity. We then focus on causal discovery based on structural equations models, in which a key issue is the identifiability of the causal structure implied by appropriately defined structural equation models: in the two-variable case, under what conditions (and why) is the causal direction between the two variables identifiable? We show that the independence between the error term and causes, together with appropriate structural constraints on the structural equation, makes it possible. Next, we report some recent advances in causal discovery from time series. Assuming that the causal relations are linear with nonGaussian noise, we mention two problems which are traditionally difficult to solve, namely causal discovery from subsampled data and that in the presence of confounding time series. Finally, we list a number of open questions in the field of causal discovery and inference.
RESUMEN
Existing score-based causal model search algorithms such as GES (and a speeded up version, FGS) are asymptotically correct, fast, and reliable, but make the unrealistic assumption that the true causal graph does not contain any unmeasured confounders. There are several constraint-based causal search algorithms (e.g RFCI, FCI, or FCI+) that are asymptotically correct without assuming that there are no unmeasured confounders, but often perform poorly on small samples. We describe a combined score and constraint-based algorithm, GFCI, that we prove is asymptotically correct. On synthetic data, GFCI is only slightly slower than RFCI but more accurate than FCI, RFCI and FCI+.
RESUMEN
We present an algorithm for estimating bounds on causal effects from observational data which combines graphical model search with simple linear regression. We assume that the underlying system can be represented by a linear structural equation model with no feedback, and we allow for the possibility of latent variables. Under assumptions standard in the causal search literature, we use conditional independence constraints to search for an equivalence class of ancestral graphs. Then, for each model in the equivalence class, we perform the appropriate regression (using causal structure information to determine which covariates to include in the regression) to estimate a set of possible causal effects. Our approach is based on the "IDA" procedure of Maathuis et al. (2009), which assumes that all relevant variables have been measured (i.e., no unmeasured confounders). We generalize their work by relaxing this assumption, which is often violated in applied contexts. We validate the performance of our algorithm on simulated data and demonstrate improved precision over IDA when latent variables are present.
RESUMEN
Community-acquired pneumonia (CAP) is an important clinical condition with regard to patient mortality, patient morbidity, and healthcare resource utilization. The assessment of the likely clinical course of a CAP patient can significantly influence decision making about whether to treat the patient as an inpatient or as an outpatient. That decision can in turn influence resource utilization, as well as patient well being. Predicting dire outcomes, such as mortality or severe clinical complications, is a particularly important component in assessing the clinical course of patients. We used a training set of 1601 CAP patient cases to construct 11 statistical and machine-learning models that predict dire outcomes. We evaluated the resulting models on 686 additional CAP-patient cases. The primary goal was not to compare these learning algorithms as a study end point; rather, it was to develop the best model possible to predict dire outcomes. A special version of an artificial neural network (NN) model predicted dire outcomes the best. Using the 686 test cases, we estimated the expected healthcare quality and cost impact of applying the NN model in practice. The particular, quantitative results of this analysis are based on a number of assumptions that we make explicit; they will require further study and validation. Nonetheless, the general implication of the analysis seems robust, namely, that even small improvements in predictive performance for prevalent and costly diseases, such as CAP, are likely to result in significant improvements in the quality and efficiency of healthcare delivery. Therefore, seeking models with the highest possible level of predictive performance is important. Consequently, seeking ever better machine-learning and statistical modeling methods is of great practical significance.
Asunto(s)
Diagnóstico por Computador/métodos , Sistemas Especialistas , Evaluación de Resultado en la Atención de Salud/métodos , Neumonía/diagnóstico , Neumonía/mortalidad , Medición de Riesgo/métodos , Análisis de Supervivencia , Infecciones Comunitarias Adquiridas/diagnóstico , Infecciones Comunitarias Adquiridas/mortalidad , Sistemas de Apoyo a Decisiones Clínicas , Humanos , Incidencia , Neumonía/terapia , Pronóstico , Curva ROC , Estudios Retrospectivos , Factores de Riesgo , Tasa de Supervivencia , Estados Unidos/epidemiologíaRESUMEN
We present evidence of a potentially serious source of error intrinsic to all spotted cDNA microarrays that use IMAGE clones of expressed sequence tags (ESTs). We found that a high proportion of these EST sequences contain 5'-end poly(dT) sequences that are remnants from the oligo(dT)-primed reverse transcription of polyadenylated mRNA templates used to generate EST cDNA for sequence clone libraries. Analysis of expression data from two single-dye cDNA microarray experiments showed that ESTs whose sequences contain repeats of consecutive 5'-end dT residues appeared to be strongly coexpressed, while expression data of all other sequences exhibited no such pattern. Our analysis suggests that expression data from sequences containing 5' poly(dT) tracts are more likely to be due to systematic cross-hybridization of these poly(dT) tracts than to true mRNA coexpression. This indicates that existing data generated by cDNA microarrays containing IMAGE clone ESTs should be filtered to remove expression data containing significant 5' poly(dT) tracts.
Asunto(s)
Artefactos , Etiquetas de Secuencia Expresada , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Adipocitos/efectos de los fármacos , Animales , Cromanos/farmacología , Humanos , Ratones , Poli T/análisis , ARN Mensajero/análisis , Tiazolidinedionas/farmacología , TroglitazonaRESUMEN
MOTIVATION: One approach to inferring genetic regulatory structure from microarray measurements of mRNA transcript hybridization is to estimate the associations of gene expression levels measured in repeated samples. The associations may be estimated by correlation coefficients or by conditional frequencies (for discretized measurements) or by some other statistic. Although these procedures have been successfully applied to other areas, their validity when applied to microarray measurements has yet to be tested. RESULTS: This paper describes an elementary statistical difficulty for all such procedures, no matter whether based on Bayesian updating, conditional independence testing, or other machine learning procedures such as simulated annealing or neural net pruning. The difficulty obtains if a number of cells from a common population are aggregated in a measurement of expression levels. Although there are special cases where the conditional associations are preserved under aggregation, in general inference of genetic regulatory structure based on conditional association is unwarranted