RESUMEN
Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the "Locally Spiky Sparse" (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called "Depth-Weighted Prevalence" (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
Asunto(s)
Algoritmos , Aprendizaje AutomáticoRESUMEN
Tree structures, showing hierarchical relationships and the latent structures between samples, are ubiquitous in genomic and biomedical sciences. A common question in many studies is whether there is an association between a response variable measured on each sample and the latent group structure represented by some given tree. Currently, this is addressed on an ad hoc basis, usually requiring the user to decide on an appropriate number of clusters to prune out of the tree to be tested against the response variable. Here, we present a statistical method with statistical guarantees that tests for association between the response variable and a fixed tree structure across all levels of the tree hierarchy with high power while accounting for the overall false positive error rate. This enhances the robustness and reproducibility of such findings.
RESUMEN
Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.
Asunto(s)
Epistasis Genética , Estudio de Asociación del Genoma Completo , Humanos , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Herencia Multifactorial/genética , Modelos Logísticos , Polimorfismo de Nucleótido SimpleRESUMEN
The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141, IGF1R, TTN, and TNKS. Several loci not prioritized by univariate genome-wide association analysis are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we show that these interactions are preserved at the level of the cardiac transcriptome. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R. Our results expand the scope of genetic regulation of cardiac structure to epistasis.
RESUMEN
The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141, IGF1R, TTN, and TNKS. Several loci not prioritized by univariate genome-wide association analysis are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we show that these interactions are preserved at the level of the cardiac transcriptome. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R. Our results expand the scope of genetic regulation of cardiac structure to epistasis.
RESUMEN
Because haplotype information is of widespread interest in biomedical applications, effort has been put into their reconstruction. Here, we propose an efficient method, called haploSep, that is able to accurately infer major haplotypes and their frequencies just from multiple samples of allele frequency data. Even the accuracy of experimentally obtained allele frequencies can be improved by re-estimating them from our reconstructed haplotypes. From a methodological point of view, we model our problem as a multivariate regression problem where both the design matrix and the coefficient matrix are unknown. Compared to other methods, haploSep is very fast, with linear computational complexity in the haplotype length. We illustrate our method on simulated and real data focusing on experimental evolution and microbial data.