Search | VHL Search Portal

1.

Testing high-dimensional multinomials with applications to text analysis.

Cai, T Tony; Ke, Zheng T; Turner, Paxton.

J R Stat Soc Series B Stat Methodol ; 86(4): 922-942, 2024 Sep.

Article in English | MEDLINE | ID: mdl-39279913

ABSTRACT

Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts.

2.

Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms.

Cai, T Tony; Ma, Rong.

IEEE Trans Inf Theory ; 70(1): 509-531, 2024 Jan.

Article in English | MEDLINE | ID: mdl-39036782

ABSTRACT

Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complexity, we analyze a popular polynomial-time algorithm, spectral seriation, and show that it is suboptimal. To address this, we propose a novel polynomial-time adaptive sorting algorithm with guaranteed performance improvement. Simulations and analyses of two real single-cell RNA sequencing datasets demonstrate the superiority of our algorithm over existing methods.

3.

Two robust tools for inference about causal effects with invalid instruments.

Kang, Hyunseung; Lee, Youjin; Cai, T Tony; Small, Dylan S.

Biometrics ; 78(1): 24-34, 2022 03.

Article in English | MEDLINE | ID: mdl-33616910

ABSTRACT

Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. However, in practice, some of the putative instrumental variables are likely to be invalid. This paper presents two tools to conduct valid inference and tests in the presence of invalid instruments. First, we propose a simple and general approach to construct confidence intervals based on taking unions of well-known confidence intervals. Second, we propose a novel test for the null causal effect based on a collider bias. Our two proposals outperform traditional instrumental variable confidence intervals when invalid instruments are present and can also be used as a sensitivity analysis when there is concern that instrumental variables assumptions are violated. The new approach is applied to a Mendelian randomization study on the causal effect of low-density lipoprotein on globulin levels.

Subject(s)

Mendelian Randomization Analysis , Bias , Causality

4.

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference.

Cai, T Tony; Zhang, Anru R; Zhou, Yuchen.

IEEE Trans Inf Theory ; 68(9): 5975-6002, 2022 Sep.

Article in English | MEDLINE | ID: mdl-36865503

ABSTRACT

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model - an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

5.

Large-Scale Simultaneous Testing of Cross-Covariance Matrices with Applications to PheWAS.

Cai, Tianxi; Cai, T Tony; Liao, Katherine; Liu, Weidong.

Stat Sin ; 29(2): 983-1005, 2019 Apr.

Article in English | MEDLINE | ID: mdl-31889766

ABSTRACT

Motivated by applications in phenome-wide association studies (PheWAS), we consider in this paper simultaneous testing of columns of high-dimensional cross-covariance matrices and develop a multiple testing procedure with theoretical guarantees. It is shown that the proposed testing procedure maintains a desired false discovery rate (FDR) and false discovery proportion (FDP) under mild regularity conditions. We also provide results on the magnitudes of the signals that can be detected with high power. Simulation studies demonstrate that the proposed procedure can be substantially more powerful than existing FDR controlling procedures in the presence of correlation of unknown structure. The proposed multiple testing procedure is applied to a PheWAS of two auto-immune genetic markers using a rheumatoid arthritis patient cohort constructed from the electronic medical records of Partners Healthcare System.

6.

Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions.

Xia, Yin; Cai, Tianxi; Cai, T Tony.

Stat Sin ; 28: 63-92, 2018 Jan.

Article in English | MEDLINE | ID: mdl-29386856

ABSTRACT

Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.

7.

Joint Estimation of Multiple High-dimensional Precision Matrices.

Cai, T Tony; Li, Hongzhe; Liu, Weidong; Xie, Jichun.

Stat Sin ; 26(2): 445-464, 2016 Apr.

Article in English | MEDLINE | ID: mdl-28316451

ABSTRACT

Motivated by analysis of gene expression data measured in different tissues or disease states, we consider joint estimation of multiple precision matrices to effectively utilize the partially shared graphical structures of the corresponding graphs. The procedure is based on a weighted constrained â∞/â1 minimization, which can be effectively implemented by a second-order cone programming. Compared to separate estimation methods, the proposed joint estimation method leads to estimators converging to the true precision matrices faster. Under certain regularity conditions, the proposed procedure leads to an exact graph structure recovery with a probability tending to 1. Simulation studies show that the proposed joint estimation methods outperform other methods in graph structure recovery. The method is illustrated through an analysis of an ovarian cancer gene expression data. The results indicate that the patients with poor prognostic subtype lack some important links among the genes in the apoptosis pathway.

8.

More powerful genetic association testing via a new statistical framework for integrative genomics.

Zhao, Sihai D; Cai, T Tony; Li, Hongzhe.

Biometrics ; 70(4): 881-90, 2014 Dec.

Article in English | MEDLINE | ID: mdl-24975802

ABSTRACT

Integrative genomics offers a promising approach to more powerful genetic association studies. The hope is that combining outcome and genotype data with other types of genomic information can lead to more powerful SNP detection. We present a new association test based on a statistical model that explicitly assumes that genetic variations affect the outcome through perturbing gene expression levels. It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that our method is relatively robust to misspecification. We also provide a strategy for applying our approach to high-dimensional genomic data. We use this strategy to identify a potentially new association between a SNP and a yeast cell's response to the natural product tomatidine, which standard association analysis did not detect.

Subject(s)

Algorithms , DNA Mutational Analysis/methods , Genetic Association Studies , Models, Statistical , Polymorphism, Single Nucleotide/genetics , Sequence Analysis, DNA/methods , Base Sequence , Computer Simulation , Models, Genetic , Molecular Sequence Data , Systems Integration

9.

Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference.

Wu, Ruijia; Zhang, Linjun; Cai, T Tony.

J Am Stat Assoc ; 118(543): 1849-1861, 2023.

Article in English | MEDLINE | ID: mdl-37771513

ABSTRACT

Sparse topic modeling under the probabilistic latent semantic indexing (pLSI) model is studied. Novel and computationally fast algorithms for estimation and inference of both the word-topic matrix and the topic-document matrix are proposed and their theoretical properties are investigated. Both minimax upper and lower bounds are established and the results show that the proposed algorithms are rate-optimal, up to a logarithmic factor. Moreover, a refitting algorithm is proposed to establish asymptotic normality and construct valid confidence intervals for the individual entries of the word-topic and topic-document matrices. Simulation studies are carried out to investigate the numerical performance of the proposed algorithms. The results show that the proposed algorithms perform well numerically and are more accurate in a range of simulation settings comparing to the existing literature. In addition, the methods are illustrated through an analysis of the COVID-19 Open Research Dataset (CORD-19).

10.

Statistical Inference for High-Dimensional Generalized Linear Models with Binary Outcomes.

Cai, T Tony; Guo, Zijian; Ma, Rong.

J Am Stat Assoc ; 118(542): 1319-1332, 2023.

Article in English | MEDLINE | ID: mdl-37366472

ABSTRACT

This paper develops a unified statistical inference framework for high-dimensional binary generalized linear models (GLMs) with general link functions. Both unknown and known design distribution settings are considered. A two-step weighted bias-correction method is proposed for constructing confidence intervals and simultaneous hypothesis tests for individual components of the regression vector. Minimax lower bound for the expected length is established and the proposed confidence intervals are shown to be rate-optimal up to a logarithmic factor. The numerical performance of the proposed procedure is demonstrated through simulation studies and an analysis of a single cell RNA-seq data set, which yields interesting biological insights that integrate well into the current literature on the cellular immune response mechanisms as characterized by single-cell transcriptomics. The theoretical analysis provides important insights on the adaptivity of optimal confidence intervals with respect to the sparsity of the regression vector. New lower bound techniques are introduced and they can be of independent interest to solve other inference problems in high-dimensional binary GLMs.

11.

Transfer Learning in Large-scale Gaussian Graphical Models with False Discovery Rate Control.

Li, Sai; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 118(543): 2171-2183, 2023.

Article in English | MEDLINE | ID: mdl-38143788

ABSTRACT

Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed.

12.

Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach.

Li, Sai; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 117(540): 1835-1846, 2022.

Article in English | MEDLINE | ID: mdl-36793369

ABSTRACT

Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that do not rely on the structural information of the variance components. We also study the estimation of variance components with high-dimensional fixed effects in general settings. The algorithms are easy to implement and computationally fast. The proposed methods are assessed in various simulation settings and are applied to a real study regarding the associations between body mass index and genetic polymorphic markers in a heterogeneous stock mice population.

13.

Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality.

Li, Sai; Cai, T Tony; Li, Hongzhe.

J R Stat Soc Series B Stat Methodol ; 84(1): 149-173, 2022 Feb.

Article in English | MEDLINE | ID: mdl-35210933

ABSTRACT

This paper considers estimation and prediction of a high-dimensional linear regression in the setting of transfer learning where, in addition to observations from the target model, auxiliary samples from different but possibly related regression models are available. When the set of informative auxiliary studies is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When the set of informative auxiliary samples is unknown, we propose a data-driven procedure for transfer learning, called Trans-Lasso, and show its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans-Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating data from multiple different tissues as auxiliary samples.

14.

Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models.

Ma, Rong; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 116(534): 984-998, 2021.

Article in English | MEDLINE | ID: mdl-34421157

ABSTRACT

High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this paper, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A lower bound for the global testing is established, which shows that the proposed test is asymptotically minimax optimal over some sparsity range. For testing the individual coefficients simultaneously, multiple testing procedures are proposed and shown to control the false discovery rate (FDR) and falsely discovered variables (FDV) asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed tests and their superiority over existing methods. The testing procedures are also illustrated by analyzing a data set of a metabolomics study that investigates the association between fecal metabolites and pediatric Crohn's disease and the effects of treatment on such associations.

15.

Optimal Permutation Recovery in Permuted Monotone Matrix Model.

Ma, Rong; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 116(535): 1358-1372, 2021.

Article in English | MEDLINE | ID: mdl-34840367

ABSTRACT

Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model Y = ΘΠ+ Z, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, Θ is an unknown mean matrix with monotone entries for each row, Π is a permutation matrix that permutes the columns of Θ, and Z is a noise matrix. This paper studies the problem of estimation/recovery of Π given the observed noisy matrix Y. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall's tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment.

16.

Optimal Estimation of Wasserstein Distance on A Tree with An Application to Microbiome Studies.

Wang, Shulei; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 116(535): 1237-1253, 2021.

Article in English | MEDLINE | ID: mdl-36860698

ABSTRACT

The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn's disease patients and the normal controls.

17.

Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective.

Wang, Shulei; Cai, T Tony; Li, Hongzhe.

Biometrika ; 108(1): 17-36, 2021 Mar.

Article in English | MEDLINE | ID: mdl-33716568

ABSTRACT

Quantitative comparison of microbial composition from different populations is a fundamental task in various microbiome studies. We consider two-sample testing for microbial compositional data by leveraging phylogenetic information. Motivated by existing phylogenetic distances, we take a minimum-cost flow perspective to study such testing problems. We first show that multivariate analysis of variance with permutation using phylogenetic distances, one of the most commonly used methods in practice, is essentially a sum-of-squares type of test and has better power for dense alternatives. However, empirical evidence from real datasets suggests that the phylogenetic microbial composition difference between two populations is usually sparse. Motivated by this observation, we propose a new maximum type test, detector of active flow on a tree, and investigate its properties. We show that the proposed method is particularly powerful against sparse phylogenetic composition difference and enjoys certain optimality. The practical merit of the proposed method is demonstrated by simulation studies and an application to a human intestinal biopsy microbiome dataset on patients with ulcerative colitis.

18.

Optimal Estimation of Genetic Relatedness in High-dimensional Linear Models.

Guo, Zijian; Wang, Wanjie; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 114(525): 358-369, 2019.

Article in English | MEDLINE | ID: mdl-38434789

ABSTRACT

Estimating the genetic relatedness between two traits based on the genome-wide association data is an important problem in genetics research. In the framework of high-dimensional linear models, we introduce two measures of genetic relatedness and develop optimal estimators for them. One is genetic covariance, which is defined to be the inner product of the two regression vectors, and another is genetic correlation, which is a normalized inner product by their lengths. We propose functional de-biased estimators (FDEs), which consist of an initial estimation step with the plug-in scaled Lasso estimator, and a further bias correction step. We also develop estimators of the quadratic functionals of the regression vectors, which can be used to estimate the heritability of each trait. The estimators are shown to be minimax rate-optimal and can be efficiently implemented. Simulation results show that FDEs provide better estimates of the genetic relatedness than simple plug-in estimates. FDE is also applied to an analysis of a yeast segregant data set with multiple traits to estimate the genetic relatedness among these traits.

19.

Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions.

Xia, Yin; Cai, Tianxi; Cai, T Tony.

J Am Stat Assoc ; 113(521): 328-339, 2018.

Article in English | MEDLINE | ID: mdl-29881130

ABSTRACT

Making accurate inference for gene regulatory networks, including inferring about pathway by pathway interactions, is an important and difficult task. Motivated by such genomic applications, we consider multiple testing for conditional dependence between subgroups of variables. Under a Gaussian graphical model framework, the problem is translated into simultaneous testing for a collection of submatrices of a high-dimensional precision matrix with each submatrix summarizing the dependence structure between two subgroups of variables. A novel multiple testing procedure is proposed and both theoretical and numerical properties of the procedure are investigated. Asymptotic null distribution of the test statistic for an individual hypothesis is established and the proposed multiple testing procedure is shown to asymptotically control the false discovery rate (FDR) and false discovery proportion (FDP) at the pre-specified level under regularity conditions. Simulations show that the procedure works well in controlling the FDR and has good power in detecting the true interactions. The procedure is applied to a breast cancer gene expression study to identify between pathway interactions.

20.

Joint testing and false discovery rate control in high-dimensional multivariate regression.

Xia, Yin; Cai, T Tony; Li, Hongzhe.

Biometrika ; 105(2): 249-269, 2018 Jun.

Article in English | MEDLINE | ID: mdl-30799872

ABSTRACT

Multivariate regression with high-dimensional covariates has many applications in genomic and genetic research, in which some covariates are expected to be associated with multiple responses. This paper considers joint testing for regression coefficients over multiple responses and develops simultaneous testing methods with false discovery rate control. The test statistic is based on inverse regression and bias-corrected group lasso estimates of the regression coefficients and is shown to have an asymptotic chi-squared null distribution. A row-wise multiple testing procedure is developed to identify the covariates associated with the responses. The procedure is shown to control the false discovery proportion and false discovery rate at a prespecified level asymptotically. Simulations demonstrate the gain in power, relative to entrywise testing, in detecting the covariates associated with the responses. The test is applied to an ovarian cancer dataset to identify the microRNA regulators that regulate protein expression.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL