RESUMO
Set-based association analysis is a valuable tool in studying the etiology of complex diseases in genome-wide association studies, as it allows for the joint testing of variants in a region or group. Two common types of single nucleotide polymorphism (SNP)-disease functional models are recognized when evaluating the joint function of a set of SNP: the cumulative weak signal model, in which multiple functional variants with small effects contribute to disease risk, and the dominating strong signal model, in which a few functional variants with large effects contribute to disease risk. However, existing methods have two main limitations that reduce their power. Firstly, they typically only consider one disease-SNP association model, which can result in significant power loss if the model is misspecified. Secondly, they do not account for the high-dimensional nature of SNPs, leading to low power or high false positives. In this study, we propose a solution to these challenges by using a high-dimensional inference procedure that involves simultaneously fitting many SNPs in a regression model. We also propose an omnibus testing procedure that employs a robust and powerful P-value combination method to enhance the power of SNP-set association. Our results from extensive simulation studies and a real data analysis demonstrate that our set-based high-dimensional inference strategy is both flexible and computationally efficient and can substantially improve the power of SNP-set association analysis. Application to a real dataset further demonstrates the utility of the testing strategy.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Humanos , Predisposição Genética para Doença , Modelos Genéticos , Algoritmos , Simulação por ComputadorRESUMO
How do statistical dependencies in measurement noise influence high-dimensional inference? To answer this, we study the paradigmatic spiked matrix model of principal components analysis (PCA), where a rank-one matrix is corrupted by additive noise. We go beyond the usual independence assumption on the noise entries, by drawing the noise from a low-order polynomial orthogonal matrix ensemble. The resulting noise correlations make the setting relevant for applications but analytically challenging. We provide characterization of the Bayes optimal limits of inference in this model. If the spike is rotation invariant, we show that standard spectral PCA is optimal. However, for more general priors, both PCA and the existing approximate message-passing algorithm (AMP) fall short of achieving the information-theoretic limits, which we compute using the replica method from statistical physics. We thus propose an AMP, inspired by the theory of adaptive Thouless-Anderson-Palmer equations, which is empirically observed to saturate the conjectured theoretical limit. This AMP comes with a rigorous state evolution analysis tracking its performance. Although we focus on specific noise distributions, our methodology can be generalized to a wide class of trace matrix ensembles at the cost of more involved expressions. Finally, despite the seemingly strong assumption of rotation-invariant noise, our theory empirically predicts algorithmic performance on real data, pointing at strong universality properties.
RESUMO
Delineating associations between images and covariates is a central aim of imaging studies. To tackle this problem, we propose a novel non-parametric approach in the framework of spatially varying coefficient models, where the spatially varying functions are estimated through deep neural networks. Our method incorporates spatial smoothness, handles subject heterogeneity, and provides straightforward interpretations. It is also highly flexible and accurate, making it ideal for capturing complex association patterns. We establish estimation and selection consistency and derive asymptotic error bounds. We demonstrate the method's advantages through intensive simulations and analyses of two functional magnetic resonance imaging data sets.
RESUMO
A mixture-model of beta distributions framework is introduced to identify significant correlations among P features when P is large. The method relies on theorems in convex geometry, which are used to show how to control the error rate of edge detection in graphical models. The proposed 'betaMix' method does not require any assumptions about the network structure, nor does it assume that the network is sparse. The results hold for a wide class of data-generating distributions that include light-tailed and heavy-tailed spherically symmetric distributions. The results are robust for sufficiently large sample sizes and hold for non-elliptically-symmetric distributions.
RESUMO
We show that a statistical mechanics model where both the Sherringhton-Kirkpatrick and Hopfield Hamiltonians appear, which is equivalent to a high-dimensional mismatched inference problem, is described by a replica symmetry-breaking Parisi solution.
RESUMO
Generalized linear models (GLMs) are used in high-dimensional machine learning, statistics, communications, and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes, or benchmark models in neural networks. We evaluate the mutual information (or "free entropy") from which we deduce the Bayes-optimal estimation and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and the dimension are large and their ratio is fixed. Nonrigorous predictions for the optimal errors existed for special cases of GLMs, e.g., for the perceptron, in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades-old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance and locate the associated sharp phase transitions separating learnable and nonlearnable regions. We believe that this random version of GLMs can serve as a challenging benchmark for multipurpose algorithms.
RESUMO
Students in statistics or data science usually learn early on that when the sample size n is large relative to the number of variables p, fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ2 The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.
RESUMO
Pathway analysis, i.e., grouping analysis, has important applications in genomic studies. Existing pathway analysis approaches are mostly focused on a single response and are not suitable for analyzing complex diseases that are often related with multiple response variables. Although a handful of approaches have been developed for multiple responses, these methods are mainly designed for pathways with a moderate number of features. A multi-response pathway analysis approach that is able to conduct statistical inference when the dimension is potentially higher than sample size is introduced. Asymptotical properties of the test statistic are established and theoretical investigation of the statistical power is conducted. Simulation studies and real data analysis show that the proposed approach performs well in identifying important pathways that influence multiple expression quantitative trait loci (eQTL).
RESUMO
BACKGROUND: Identifying gene interactions is a topic of great importance in genomics, and approaches based on network models provide a powerful tool for studying these. Assuming a Gaussian graphical model, a gene association network may be estimated from multiomic data based on the non-zero entries of the inverse covariance matrix. Inferring such biological networks is challenging because of the high dimensionality of the problem, making traditional estimators unsuitable. The graphical lasso is constructed for the estimation of sparse inverse covariance matrices in such situations, using [Formula: see text]-penalization on the matrix entries. The weighted graphical lasso is an extension in which prior biological information from other sources is integrated into the model. There are however issues with this approach, as it naïvely forces the prior information into the network estimation, even if it is misleading or does not agree with the data at hand. Further, if an associated network based on other data is used as the prior, the method often fails to utilize the information effectively. RESULTS: We propose a novel graphical lasso approach, the tailored graphical lasso, that aims to handle prior information of unknown accuracy more effectively. We provide an R package implementing the method, tailoredGlasso. Applying the method to both simulated and real multiomic data sets, we find that it outperforms the unweighted and weighted graphical lasso in terms of all performance measures we consider. In fact, the graphical lasso and weighted graphical lasso can be considered special cases of the tailored graphical lasso, and a parameter determined by the data measures the usefulness of the prior information. We also find that among a larger set of methods, the tailored graphical is the most suitable for network inference from high-dimensional data with prior information of unknown accuracy. With our method, mRNA data are demonstrated to provide highly useful prior information for protein-protein interaction networks. CONCLUSIONS: The method we introduce utilizes useful prior information more effectively without involving any risk of loss of accuracy should the prior information be misleading.
Assuntos
Algoritmos , Redes Reguladoras de Genes , Genômica , Distribuição Normal , Mapas de Interação de ProteínasRESUMO
We introduce an estimation method of covariance matrices in a high-dimensional setting, i.e., when the dimension of the matrix, p, is larger than the sample size n. Specifically, we propose an orthogonally equivariant estimator. The eigenvectors of such estimator are the same as those of the sample covariance matrix. The eigenvalue estimates are obtained from an adjusted profile likelihood function derived by approximating the integral of the density function of the sample covariance matrix over its eigenvectors, which is a challenging problem in its own right. Exact solutions to the approximate likelihood equations are obtained and employed to construct estimates that involve a tuning parameter. Bootstrap and cross-validation based algorithms are proposed to choose this tuning parameter under various loss functions. Finally, comparisons with two well-known orthogonally equivariant estimators are given, which are based on Monte-Carlo risk estimates for simulated data and misclassification errors in real data analyses.
RESUMO
After variable selection, standard inferential procedures for regression parameters may not be uniformly valid; there is no finite-sample size at which a standard test is guaranteed to approximately attain its nominal size. This problem is exacerbated in high-dimensional settings, where variable selection becomes unavoidable. This has prompted a flurry of activity in developing uniformly valid hypothesis tests for a low-dimensional regression parameter (eg, the causal effect of an exposure A on an outcome Y) in high-dimensional models. So far there has been limited focus on model misspecification, although this is inevitable in high-dimensional settings. We propose tests of the null that are uniformly valid under sparsity conditions weaker than those typically invoked in the literature, assuming working models for the exposure and outcome are both correctly specified. When one of the models is misspecified, by amending the procedure for estimating the nuisance parameters, our tests continue to be valid; hence, they are doubly robust. Our proposals are straightforward to implement using existing software for penalized maximum likelihood estimation and do not require sample splitting. We illustrate them in simulations and an analysis of data obtained from the Ghent University intensive care unit.
Assuntos
Simulação por Computador , Causalidade , Humanos , Tamanho da AmostraRESUMO
The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X 1, , X p ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.
RESUMO
Large economic and financial networks may experience stage-wise changes as a result of external shocks. To detect and infer a structural change, we consider an inference problem within a framework of multiple Gaussian Graphical Models when the number of graphs and the dimension of graphs increase with the sample size. In this setting, two major challenges emerge as a result of the bias and uncertainty inherent in the regularization required to treat such overparameterized models. To deal with these challenges, the bootstrap method is utilized to approximate the sampling distribution of a likelihood ratio test statistic. We show theoretically that the proposed method leads to a correct asymptotic inference in a high-dimensional setting, regardless of the distribution of the test statistic. Simulations show that the proposed method compares favorably to its competitors such as the Likelihood Ratio Test. Finally, our statistical analysis of a network of 200 stocks reveals that the interacting units in the financial network become more connected as a result of the financial crisis between 2007 and 2009. More importantly, certain units respond more strongly than others. Furthermore, after the crisis, some changes weaken, while others strengthen.
RESUMO
Drawing inferences for high-dimensional models is challenging as regular asymptotic theories are not applicable. This article proposes a new framework of simultaneous estimation and inferences for high-dimensional linear models. By smoothing over partial regression estimates based on a given variable selection scheme, we reduce the problem to low-dimensional least squares estimations. The procedure, termed as Selection-assisted Partial Regression and Smoothing (SPARES), utilizes data splitting along with variable selection and partial regression. We show that the SPARES estimator is asymptotically unbiased and normal, and derive its variance via a nonparametric delta method. The utility of the procedure is evaluated under various simulation scenarios and via comparisons with the de-biased LASSO estimators, a major competitor. We apply the method to analyze two genomic datasets and obtain biologically meaningful results.
Assuntos
Modelos Lineares , Simulação por Computador , Genômica/estatística & dados numéricos , Humanos , Análise dos Mínimos Quadrados , Análise de RegressãoRESUMO
Data with high-dimensional covariates are now commonly encountered. Compared to other types of responses, research on high-dimensional data with censored survival responses is still relatively limited, and most of the existing studies have been focused on estimation and variable selection. In this study, we consider data with a censored survival response, a set of low-dimensional covariates of main interest, and a set of high-dimensional covariates that may also affect survival. The accelerated failure time model is adopted to describe survival. The goal is to conduct inference for the effects of low-dimensional covariates, while properly accounting for the high-dimensional covariates. A penalization-based procedure is developed, and its validity is established under mild and widely adopted conditions. Simulation suggests satisfactory performance of the proposed procedure, and the analysis of two cancer genetic datasets demonstrates its practical applicability.
RESUMO
BACKGROUND: Procedures for controlling the false discovery rate (FDR) are widely applied as a solution to the multiple comparisons problem of high-dimensional statistics. Current FDR-controlling procedures require accurately calculated p-values and rely on extrapolation into the unknown and unobserved tails of the null distribution. Both of these intermediate steps are challenging and can compromise the reliability of the results. RESULTS: We present a general method for controlling the FDR that capitalizes on the large amount of control data often found in big data studies to avoid these frequently problematic intermediate steps. The method utilizes control data to empirically construct the distribution of the test statistic under the null hypothesis and directly compares this distribution to the empirical distribution of the test data. By not relying on p-values, our control data-based empirical FDR procedure more closely follows the foundational principles of the scientific method: that inference is drawn by comparing test data to control data. The method is demonstrated through application to a problem in structural genomics. CONCLUSIONS: The method described here provides a general statistical framework for controlling the FDR that is specifically tailored for the big data setting. By relying on empirically constructed distributions and control data, it forgoes potentially problematic modeling steps and extrapolation into the unknown tails of the null distribution. This procedure is broadly applicable insofar as controlled experiments or internal negative controls are available, as is increasingly common in the big data setting.
Assuntos
Modelos Estatísticos , Teorema de Bayes , Reparo do DNA , Bases de Dados Factuais , Genoma Humano , HumanosRESUMO
This paper proposes a decorrelation-based approach to test hypotheses and construct confidence intervals for the low dimensional component of high dimensional proportional hazards models. Motivated by the geometric projection principle, we propose new decorrelated score, Wald and partial likelihood ratio statistics. Without assuming model selection consistency, we prove the asymptotic normality of these test statistics, establish their semiparametric optimality. We also develop new procedures for constructing pointwise confidence intervals for the baseline hazard function and baseline survival function. Thorough numerical results are provided to back up our theory.
RESUMO
Statistical analysis of multimodal imaging data is a challenging task, since the data involves high-dimensionality, strong spatial correlations and complex data structures. In this paper, we propose rigorous statistical testing procedures for making inferences on the complex dependence of multimodal imaging data. Motivated by the analysis of multitask fMRI data in the Human Connectome Project (HCP) study, we particularly address three hypothesis testing problems: (a) testing independence among imaging modalities over brain regions, (b) testing independence between brain regions within imaging modalities, and (c) testing independence between brain regions across different modalities. Considering a general form for all the three tests, we develop a global testing procedure and a multiple testing procedure controlling the false discovery rate. We study theoretical properties of the proposed tests and develop a computationally efficient distributed algorithm. The proposed methods and theory are general and relevant for many statistical problems of testing independence structure among the components of high-dimensional random vectors with arbitrary dependence structures. We also illustrate our proposed methods via extensive simulations and analysis of five task fMRI contrast maps in the HCP study.
RESUMO
For high-dimensional generalized linear models (GLMs) with massive data, this paper investigates a unified optimal Poisson subsampling scheme to conduct estimation and inference for prespecified low-dimensional partition of the whole parameter. A Poisson subsampling decorrelated score function is proposed such that the adverse effect of the less accurate nuisance parameter estimation with slow convergence rate can be mitigated. The resultant Poisson subsample estimator is proved to enjoy consistency and asymptotic normality, and a more general optimal subsampling criterion including A- and L-optimality criteria is formulated to improve estimation efficiency. We also propose a two-step algorithm for implementation and discuss some practical issues. The satisfactory performance of our method is validated through simulation studies and a real dataset.
RESUMO
We propose the Factor Augmented (sparse linear) Regression Model (FARM) that not only admits both the latent factor regression and sparse linear regression as special cases but also bridges dimension reduction and sparse regression together. We provide theoretical guarantees for the estimation of our model under the existence of sub-Gaussian and heavy-tailed noises (with bounded (1 + Ï) -th moment, for all Ï > 0) respectively. In addition, the existing works on supervised learning often assume the latent factor regression or sparse linear regression is the true underlying model without justifying its adequacy. To fill in such an important gap on high-dimensional inference, we also leverage our model as the alternative model to test the sufficiency of the latent factor regression and the sparse linear regression models. To accomplish these goals, we propose the Factor-Adjusted deBiased Test (FabTest) and a two-stage ANOVA type test respectively. We also conduct large-scale numerical experiments including both synthetic and FRED macroeconomics data to corroborate the theoretical properties of our methods. Numerical results illustrate the robustness and effectiveness of our model against latent factor regression and sparse linear regression models.