RESUMO
MOTIVATION: Valid statistical inference is crucial for decision-making but difficult to obtain in supervised learning with multimodal data, e.g. combinations of clinical features, genomic data, and medical images. Multimodal data often warrants the use of black-box algorithms, for instance, random forests or neural networks, which impede the use of traditional variable significance tests. RESULTS: We address this problem by proposing the use of COvariance MEasure Tests (COMETs), which are calibrated and powerful tests that can be combined with any sufficiently predictive supervised learning algorithm. We apply COMETs to several high-dimensional, multimodal data sets to illustrate (i) variable significance testing for finding relevant mutations modulating drug-activity, (ii) modality selection for predicting survival in liver cancer patients with multiomics data, and (iii) modality selection with clinical features and medical imaging data. In all applications, COMETs yield results consistent with domain knowledge without requiring data-driven pre-processing, which may invalidate type I error control. These novel applications with high-dimensional multimodal data corroborate prior results on the power and robustness of COMETs for significance testing. AVAILABILITY AND IMPLEMENTATION: COMETs are implemented in the cometsR package available on CRAN and pycometsPython library available on GitHub. Source code for reproducing all results is available at https://github.com/LucasKook/comets. All data sets used in this work are openly available.
Assuntos
Algoritmos , Aprendizado de Máquina Supervisionado , Humanos , Neoplasias Hepáticas/genética , Biologia Computacional/métodosRESUMO
How cells regulate their cell cycles is a central question for cell biology. Models of cell size homeostasis have been proposed for bacteria, archaea, yeast, plant, and mammalian cells. New experiments bring forth high volumes of data suitable for testing existing models of cell size regulation and proposing new mechanisms. In this paper, we use conditional independence tests in conjunction with data of cell size at key cell cycle events (birth, initiation of DNA replication, and constriction) in the model bacterium Escherichia coli to select between the competing cell cycle models. We find that in all growth conditions that we study, the division event is controlled by the onset of constriction at midcell. In slow growth, we corroborate a model where replication-related processes control the onset of constriction at midcell. In faster growth, we find that the onset of constriction is affected by additional cues beyond DNA replication. Finally, we also find evidence for the presence of additional cues triggering initiations of DNA replication apart from the conventional notion where the mother cells solely determine the initiation event in the daughter cells via an adder per origin model. The use of conditional independence tests is a different approach in the context of understanding cell cycle regulation and it can be used in future studies to further explore the causal links between cell events.
Assuntos
Proteínas de Escherichia coli , Escherichia coli , Escherichia coli/genética , Ciclo Celular , Divisão Celular , Replicação do DNA , Proteínas de Escherichia coli/metabolismoRESUMO
Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.
Assuntos
Algoritmos , Doença de Alzheimer , Simulação por Computador , Humanos , Doença de Alzheimer/genética , Modelos Estatísticos , Interpretação Estatística de Dados , Biometria/métodosRESUMO
Network psychometrics uses graphical models to assess the network structure of psychological variables. An important task in their analysis is determining which variables are unrelated in the network, i.e., are independent given the rest of the network variables. This conditional independence structure is a gateway to understanding the causal structure underlying psychological processes. Thus, it is crucial to have an appropriate method for evaluating conditional independence and dependence hypotheses. Bayesian approaches to testing such hypotheses allow researchers to differentiate between absence of evidence and evidence of absence of connections (edges) between pairs of variables in a network. Three Bayesian approaches to assessing conditional independence have been proposed in the network psychometrics literature. We believe that their theoretical foundations are not widely known, and therefore we provide a conceptual review of the proposed methods and highlight their strengths and limitations through a simulation study. We also illustrate the methods using an empirical example with data on Dark Triad Personality. Finally, we provide recommendations on how to choose the optimal method and discuss the current gaps in the literature on this important topic.
Assuntos
Teorema de Bayes , Psicometria , Psicometria/métodos , Humanos , Simulação por Computador , Modelos Estatísticos , Interpretação Estatística de DadosRESUMO
It has been increasingly appealing to evaluate whether expression levels of two genes in a gene coexpression network are still dependent given samples' clinical information, in which the conditional independence test plays an essential role. For enhanced robustness regarding model assumptions, we propose a class of double-robust tests for evaluating the dependence of bivariate outcomes after controlling for known clinical information. Although the proposed test relies on the marginal density functions of bivariate outcomes given clinical information, the test remains valid as long as one of the density functions is correctly specified. Because of the closed-form variance formula, the proposed test procedure enjoys computational efficiency without requiring a resampling procedure or tuning parameters. We acknowledge the need to infer the conditional independence network with high-dimensional gene expressions, and further develop a procedure for multiple testing by controlling the false discovery rate. Numerical results show that our method accurately controls both the type-I error and false discovery rate, and it provides certain levels of robustness regarding model misspecification. We apply the method to a gastric cancer study with gene expression data to understand the associations between genes belonging to the transforming growth factor ß signaling pathway given cancer-stage information.
Assuntos
Redes Reguladoras de Genes , Neoplasias , Humanos , Neoplasias/genéticaRESUMO
We introduce a method to draw causal inferences-inferences immune to all possible confounding-from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by developing a conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed digital twin test compares an observed offspring to carefully constructed synthetic offspring from the same parents to determine statistical significance, and it can leverage any black-box multivariate model and additional nontrio genetic data to increase power. Crucially, our inferences are based only on a well-established mathematical model of recombination and make no assumptions about the relationship between the genotypes and phenotypes. We compare our method to the widely used transmission disequilibrium test and demonstrate enhanced power and localization.
Assuntos
Estudos de Associação Genética , Técnicas Genéticas , Variação Genética , Hereditariedade , Fenótipo , HumanosRESUMO
The Conditional Independence (CI) test is a fundamental problem in statistics. Many nonparametric CI tests have been developed, but a common challenge exists: the current methods perform poorly with a high-dimensional conditioning set. In this paper, we considered a nonparametric CI test using a kernel-based test statistic, which can be viewed as an extension of the Hilbert-Schmidt Independence Criterion (HSIC). We propose a local bootstrap method to generate samples from the null distribution H0:Xâ««Yâ£Z. The experimental results showed that our proposed method led to a significant performance improvement compared with previous methods. In particular, our method performed well against the growth of the dimension of the conditioning set. Meanwhile, our method can be computed efficiently against the growth of the sample size and the dimension of the conditioning set.
RESUMO
Conditional independence assumption of truncation and failure times conditioning on covariates is a fundamental and common assumption in the regression analysis of left-truncated and right-censored data. Testing for this assumption is essential to ensure the correct inference on the failure time, but this has often been overlooked in the literature. With consideration of challenges caused by left truncation and right censoring, tests for this conditional independence assumption are developed in which the generalized odds ratio derived from a Cox proportional hazards model on the failure time and the concept of Kendall's tau are combined. Except for the Cox proportional hazards model, no additional model assumptions are imposed, and the distributions of the truncation time and conditioning variables are unspecified. The asymptotic properties of the test statistic are established and an easy implementation for obtaining its distribution is developed. The performance of the proposed test has been evaluated through simulation studies and two real studies.
RESUMO
We review the principal information theoretic tools and their use for feature selection, with the main emphasis on classification problems with discrete features. Since it is known that empirical versions of conditional mutual information perform poorly for high-dimensional problems, we focus on various ways of constructing its counterparts and the properties and limitations of such methods. We present a unified way of constructing such measures based on truncation, or truncation and weighing, for the Möbius expansion of conditional mutual information. We also discuss the main approaches to feature selection which apply the introduced measures of conditional dependence, together with the ways of assessing the quality of the obtained vector of predictors. This involves discussion of recent results on asymptotic distributions of empirical counterparts of criteria, as well as advances in resampling.
RESUMO
In this study, we focus on mixed data which are either observations of univariate random variables which can be quantitative or qualitative, or observations of multivariate random variables such that each variable can include both quantitative and qualitative components. We first propose a novel method, called CMIh, to estimate conditional mutual information taking advantages of the previously proposed approaches for qualitative and quantitative data. We then introduce a new local permutation test, called LocAT for local adaptive test, which is well adapted to mixed data. Our experiments illustrate the good behaviour of CMIh and LocAT, and show their respective abilities to accurately estimate conditional mutual information and to detect conditional (in)dependence for mixed data.
RESUMO
In this work, we focus on a general family of measures of divergence for estimation and testing with emphasis on conditional independence in cross tabulations. For this purpose, a restricted minimum divergence estimator is used for the estimation of parameters under constraints and a new double index (dual) divergence test statistic is introduced and thoroughly examined. The associated asymptotic theory is provided and the advantages and practical implications are explored via simulation studies.
RESUMO
Examined in this paper is the Gray and Wyner source coding for a simple network of correlated multivariate Gaussian random variables, Y1:ΩâRp1 and Y2:ΩâRp2. The network consists of an encoder that produces two private rates R1 and R2, and a common rate R0, and two decoders, where decoder 1 receives rates (R1,R0) and reproduces Y1 by Y^1, and decoder 2 receives rates (R2,R0) and reproduces Y2 by Y^2, with mean-square error distortions E||Yi-Y^i||Rpi2≤Δi∈[0,∞],i=1,2. Use is made of the weak stochastic realization and the geometric approach of such random variables to derive test channel distributions, which characterize the rates that lie on the Gray and Wyner rate region. Specific new results include: (1) A proof that, among all continuous or finite-valued random variables, W:ΩâW, Wyner's common information, C(Y1,Y2)=infPY1,Y2,W:PY1,Y2|W=PY1|WPY2|WI(Y1,Y2;W), is achieved by a Gaussian random variable, W:ΩâRn of minimum dimension n, which makes the two components of the tuple (Y1,Y2) conditionally independent according to the weak stochastic realization of (Y1,Y2), and a the formula C(Y1,Y2)=12∑j=1nln1+dj1-dj, where di∈(0,1),i=1, ,n are the canonical correlation coefficients of the correlated parts of Y1 and Y2, and a realization of (Y1,Y2,W) which achieves this. (2) The parameterization of rates that lie on the Gray and Wyner rate region, and several of its subsets. The discussion is largely self-contained and proceeds from first principles, while connections to prior literature is discussed.
RESUMO
The Granger causality test is essential for detecting lead-lag relationships between time series. Traditionally, one uses a linear version of the test, essentially based on a linear time series regression, itself being based on autocorrelations and cross-correlations of the series. In the present paper, we employ a local Gaussian approach in an empirical investigation of lead-lag and causality relations. The study is carried out for monthly recorded financial indices for ten countries in Europe, North America, Asia and Australia. The local Gaussian approach makes it possible to examine lead-lag relations locally and separately in the tails and in the center of the return distributions of the series. It is shown that this results in a new and much more detailed picture of these relationships. Typically, the dependence is much stronger in the tails than in the center of the return distributions. It is shown that the ensuing nonlinear Granger causality tests may detect causality where traditional linear tests fail.
RESUMO
Gaussian graphical models are usually estimated from unreplicated data. The data are, however, likely to comprise signal and noise. These two cannot be deconvoluted from unreplicated data. Pragmatically, the noise is then ignored in practice. We point out the consequences of this practice for the reconstruction of the conditional independence graph of the signal. Replicated data allow for the deconvolution of signal and noise and the reconstruction of former's conditional independence graph. Hereto we present a penalized Expectation-Maximization algorithm. The penalty parameter is chosen to maximize the F-fold cross-validated log-likelihood. Sampling schemes of the folds from replicated data are discussed. By simulation we investigate the effect of replicates on the reconstruction of the signal's conditional independence graph. Moreover, we compare the proposed method to several obvious competitors. In an application we use data from oncogenomic studies with replicates to reconstruct the gene-gene interaction networks, operationalized as conditional independence graphs. This yields a realistic portrait of the effect of ignoring other sources but sampling variation. In addition, it bears implications on the reproducibility of inferred gene-gene interaction networks reported in literature.
Assuntos
Algoritmos , Redes Reguladoras de Genes , Simulação por Computador , Humanos , Distribuição Normal , Reprodutibilidade dos TestesRESUMO
Pairwise Markov random field networks-including Gaussian graphical models (GGMs) and Ising models-have become the "state-of-the-art" method for psychopathology network analyses. Recent research has focused on the reliability and replicability of these networks. In the present study, we compared the existing suite of methods for maximizing and quantifying the stability and consistency of PMRF networks (i.e., lasso regularization, plus the bootnet and NetworkComparisonTest packages in R) with a set of metrics for directly comparing the detailed network characteristics interpreted in the literature (e.g., the presence, absence, sign, and strength of each individual edge). We compared GGMs of depression and anxiety symptoms in two waves of data from an observational study (n = 403) and reanalyzed four posttraumatic stress disorder GGMs from a recent study of network replicability. Taken on face value, the existing suite of methods indicated that overall the network edges were stable, interpretable, and consistent between networks, but the direct metrics of replication indicated that this was not the case (e.g., 39-49% of the edges in each network were unreplicated across the pairwise comparisons). We discuss reasons for these apparently contradictory results (e.g., relying on global summary statistics versus examining the detailed characteristics interpreted in the literature) and conclude that the limited reliability of the detailed characteristics of networks observed here is likely to be common in practice, but overlooked by current methods. Poor replicability underpins our concern surrounding the use of these methods, given that generalizable conclusions are fundamental to the utility of their results.
Assuntos
Ansiedade , Transtornos de Estresse Pós-Traumáticos , Humanos , Distribuição Normal , Reprodutibilidade dos Testes , Projetos de PesquisaRESUMO
Protein homeostasis, proteostasis, is essential for healthy cell functioning and is dysregulated in many diseases. Metabolic labeling with heavy water followed by liquid chromatography coupled online to mass spectrometry (LC-MS) is a powerful high-throughput technique to study proteome dynamics in vivo. Longer labeling duration and dense timepoint sampling (TPS) of tissues provide accurate proteome dynamics estimations. However, the experiments are expensive, and they require animal housing and care, as well as labeling with stable isotopes. Often, the animals are sacrificed at selected timepoints to collect tissues. Therefore, it is necessary to optimize TPS for a given number of sampling points and labeling duration and target a specific tissue of study. Currently, such techniques are missing in proteomics. Here, we report on a formula-based stochastic simulation strategy for TPS for in vivo studies with heavy water metabolic labeling and LC-MS. We model the rate constant (lognormal), measurement error (Laplace), peptide length (gamma), relative abundance of the monoisotopic peak (beta regression), and the number of exchangeable hydrogens (gamma regression). The parameters of the distributions are determined using the corresponding empirical probability density functions from a large-scale dataset of murine heart proteome. The models are used in the simulations of the rate constant to minimize the root-mean-square error (rmse). The rmse for different TPSs shows structured patterns. They are analyzed to elucidate common features in the patterns.
Assuntos
Proteoma , Espectrometria de Massas em Tandem , Animais , Cromatografia Líquida , Óxido de Deutério , Marcação por Isótopo , CamundongosRESUMO
Complexity measures in the context of the Integrated Information Theory of consciousness try to quantify the strength of the causal connections between different neurons. This is done by minimizing the KL-divergence between a full system and one without causal cross-connections. Various measures have been proposed and compared in this setting. We will discuss a class of information geometric measures that aim at assessing the intrinsic causal cross-influences in a system. One promising candidate of these measures, denoted by ΦCIS, is based on conditional independence statements and does satisfy all of the properties that have been postulated as desirable. Unfortunately it does not have a graphical representation, which makes it less intuitive and difficult to analyze. We propose an alternative approach using a latent variable, which models a common exterior influence. This leads to a measure ΦCII, Causal Information Integration, that satisfies all of the required conditions. Our measure can be calculated using an iterative information geometric algorithm, the em-algorithm. Therefore we are able to compare its behavior to existing integrated information measures.
RESUMO
Combining the information bottleneck model with deep learning by replacing mutual information terms with deep neural nets has proven successful in areas ranging from generative modelling to interpreting deep neural networks. In this paper, we revisit the deep variational information bottleneck and the assumptions needed for its derivation. The two assumed properties of the data, X and Y, and their latent representation T, take the form of two Markov chains T - X - Y and X - T - Y . Requiring both to hold during the optimisation process can be limiting for the set of potential joint distributions P ( X , Y , T ) . We, therefore, show how to circumvent this limitation by optimising a lower bound for the mutual information between T and Y: I ( T ; Y ) , for which only the latter Markov chain has to be satisfied. The mutual information I ( T ; Y ) can be split into two non-negative parts. The first part is the lower bound for I ( T ; Y ) , which is optimised in deep variational information bottleneck (DVIB) and cognate models in practice. The second part consists of two terms that measure how much the former requirement T - X - Y is violated. Finally, we propose interpreting the family of information bottleneck models as directed graphical models, and show that in this framework, the original and deep information bottlenecks are special cases of a fundamental IB model.
RESUMO
A distributed binary hypothesis testing (HT) problem involving two parties, a remote observer and a detector, is studied. The remote observer has access to a discrete memoryless source, and communicates its observations to the detector via a rate-limited noiseless channel. The detector observes another discrete memoryless source, and performs a binary hypothesis test on the joint distribution of its own observations with those of the observer. While the goal of the observer is to maximize the type II error exponent of the test for a given type I error probability constraint, it also wants to keep a private part of its observations as oblivious to the detector as possible. Considering both equivocation and average distortion under a causal disclosure assumption as possible measures of privacy, the trade-off between the communication rate from the observer to the detector, the type II error exponent, and privacy is studied. For the general HT problem, we establish single-letter inner bounds on both the rate-error exponent-equivocation and rate-error exponent-distortion trade-offs. Subsequently, single-letter characterizations for both trade-offs are obtained (i) for testing against conditional independence of the observer's observations from those of the detector, given some additional side information at the detector; and (ii) when the communication rate constraint over the channel is zero. Finally, we show by providing a counter-example where the strong converse which holds for distributed HT without a privacy constraint does not hold when a privacy constraint is imposed. This implies that in general, the rate-error exponent-equivocation and rate-error exponent-distortion trade-offs are not independent of the type I error probability constraint.
RESUMO
Quantitatively identifying direct dependencies between variables is an important task in data analysis, in particular for reconstructing various types of networks and causal relations in science and engineering. One of the most widely used criteria is partial correlation, but it can only measure linearly direct association and miss nonlinear associations. However, based on conditional independence, conditional mutual information (CMI) is able to quantify nonlinearly direct relationships among variables from the observed data, superior to linear measures, but suffers from a serious problem of underestimation, in particular for those variables with tight associations in a network, which severely limits its applications. In this work, we propose a new concept, "partial independence," with a new measure, "part mutual information" (PMI), which not only can overcome the problem of CMI but also retains the quantification properties of both mutual information (MI) and CMI. Specifically, we first defined PMI to measure nonlinearly direct dependencies between variables and then derived its relations with MI and CMI. Finally, we used a number of simulated data as benchmark examples to numerically demonstrate PMI features and further real gene expression data from Escherichia coli and yeast to reconstruct gene regulatory networks, which all validated the advantages of PMI for accurately quantifying nonlinearly direct associations in networks.