Pesquisa | Biblioteca Virtual em Saúde

1.

Transfer Learning in Large-scale Gaussian Graphical Models with False Discovery Rate Control.

Li, Sai; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 118(543): 2171-2183, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38143788

RESUMO

Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed.

2.

Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference.

Wu, Ruijia; Zhang, Linjun; Cai, T Tony.

J Am Stat Assoc ; 118(543): 1849-1861, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37771513

RESUMO

Sparse topic modeling under the probabilistic latent semantic indexing (pLSI) model is studied. Novel and computationally fast algorithms for estimation and inference of both the word-topic matrix and the topic-document matrix are proposed and their theoretical properties are investigated. Both minimax upper and lower bounds are established and the results show that the proposed algorithms are rate-optimal, up to a logarithmic factor. Moreover, a refitting algorithm is proposed to establish asymptotic normality and construct valid confidence intervals for the individual entries of the word-topic and topic-document matrices. Simulation studies are carried out to investigate the numerical performance of the proposed algorithms. The results show that the proposed algorithms perform well numerically and are more accurate in a range of simulation settings comparing to the existing literature. In addition, the methods are illustrated through an analysis of the COVID-19 Open Research Dataset (CORD-19).

3.

Statistical Inference for High-Dimensional Generalized Linear Models with Binary Outcomes.

Cai, T Tony; Guo, Zijian; Ma, Rong.

J Am Stat Assoc ; 118(542): 1319-1332, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37366472

RESUMO

This paper develops a unified statistical inference framework for high-dimensional binary generalized linear models (GLMs) with general link functions. Both unknown and known design distribution settings are considered. A two-step weighted bias-correction method is proposed for constructing confidence intervals and simultaneous hypothesis tests for individual components of the regression vector. Minimax lower bound for the expected length is established and the proposed confidence intervals are shown to be rate-optimal up to a logarithmic factor. The numerical performance of the proposed procedure is demonstrated through simulation studies and an analysis of a single cell RNA-seq data set, which yields interesting biological insights that integrate well into the current literature on the cellular immune response mechanisms as characterized by single-cell transcriptomics. The theoretical analysis provides important insights on the adaptivity of optimal confidence intervals with respect to the sparsity of the regression vector. New lower bound techniques are introduced and they can be of independent interest to solve other inference problems in high-dimensional binary GLMs.

4.

Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality.

Li, Sai; Cai, T Tony; Li, Hongzhe.

J R Stat Soc Series B Stat Methodol ; 84(1): 149-173, 2022 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-35210933

RESUMO

This paper considers estimation and prediction of a high-dimensional linear regression in the setting of transfer learning where, in addition to observations from the target model, auxiliary samples from different but possibly related regression models are available. When the set of informative auxiliary studies is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When the set of informative auxiliary samples is unknown, we propose a data-driven procedure for transfer learning, called Trans-Lasso, and show its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans-Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating data from multiple different tissues as auxiliary samples.

5.

Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach.

Li, Sai; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 117(540): 1835-1846, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36793369

RESUMO

Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that do not rely on the structural information of the variance components. We also study the estimation of variance components with high-dimensional fixed effects in general settings. The algorithms are easy to implement and computationally fast. The proposed methods are assessed in various simulation settings and are applied to a real study regarding the associations between body mass index and genetic polymorphic markers in a heterogeneous stock mice population.

6.

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference.

Cai, T Tony; Zhang, Anru R; Zhou, Yuchen.

IEEE Trans Inf Theory ; 68(9): 5975-6002, 2022 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-36865503

RESUMO

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model - an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

7.

Two robust tools for inference about causal effects with invalid instruments.

Kang, Hyunseung; Lee, Youjin; Cai, T Tony; Small, Dylan S.

Biometrics ; 78(1): 24-34, 2022 03.

Artigo em Inglês | MEDLINE | ID: mdl-33616910

RESUMO

Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. However, in practice, some of the putative instrumental variables are likely to be invalid. This paper presents two tools to conduct valid inference and tests in the presence of invalid instruments. First, we propose a simple and general approach to construct confidence intervals based on taking unions of well-known confidence intervals. Second, we propose a novel test for the null causal effect based on a collider bias. Our two proposals outperform traditional instrumental variable confidence intervals when invalid instruments are present and can also be used as a sensitivity analysis when there is concern that instrumental variables assumptions are violated. The new approach is applied to a Mendelian randomization study on the causal effect of low-density lipoprotein on globulin levels.

Assuntos

Análise da Randomização Mendeliana , Viés , Causalidade

8.

Optimal Permutation Recovery in Permuted Monotone Matrix Model.

Ma, Rong; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 116(535): 1358-1372, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34840367

RESUMO

Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model Y = ΘΠ+ Z, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, Θ is an unknown mean matrix with monotone entries for each row, Π is a permutation matrix that permutes the columns of Θ, and Z is a noise matrix. This paper studies the problem of estimation/recovery of Π given the observed noisy matrix Y. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall's tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment.

9.

Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models.

Ma, Rong; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 116(534): 984-998, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34421157

RESUMO

High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this paper, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A lower bound for the global testing is established, which shows that the proposed test is asymptotically minimax optimal over some sparsity range. For testing the individual coefficients simultaneously, multiple testing procedures are proposed and shown to control the false discovery rate (FDR) and falsely discovered variables (FDV) asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed tests and their superiority over existing methods. The testing procedures are also illustrated by analyzing a data set of a metabolomics study that investigates the association between fecal metabolites and pediatric Crohn's disease and the effects of treatment on such associations.

10.

Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective.

Wang, Shulei; Cai, T Tony; Li, Hongzhe.

Biometrika ; 108(1): 17-36, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-33716568

RESUMO

Quantitative comparison of microbial composition from different populations is a fundamental task in various microbiome studies. We consider two-sample testing for microbial compositional data by leveraging phylogenetic information. Motivated by existing phylogenetic distances, we take a minimum-cost flow perspective to study such testing problems. We first show that multivariate analysis of variance with permutation using phylogenetic distances, one of the most commonly used methods in practice, is essentially a sum-of-squares type of test and has better power for dense alternatives. However, empirical evidence from real datasets suggests that the phylogenetic microbial composition difference between two populations is usually sparse. Motivated by this observation, we propose a new maximum type test, detector of active flow on a tree, and investigate its properties. We show that the proposed method is particularly powerful against sparse phylogenetic composition difference and enjoys certain optimality. The practical merit of the proposed method is demonstrated by simulation studies and an application to a human intestinal biopsy microbiome dataset on patients with ulcerative colitis.

11.

Optimal Estimation of Wasserstein Distance on A Tree with An Application to Microbiome Studies.

Wang, Shulei; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 116(535): 1237-1253, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-36860698

RESUMO

The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn's disease patients and the normal controls.

12.

Large-Scale Simultaneous Testing of Cross-Covariance Matrices with Applications to PheWAS.

Cai, Tianxi; Cai, T Tony; Liao, Katherine; Liu, Weidong.

Stat Sin ; 29(2): 983-1005, 2019 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-31889766

RESUMO

Motivated by applications in phenome-wide association studies (PheWAS), we consider in this paper simultaneous testing of columns of high-dimensional cross-covariance matrices and develop a multiple testing procedure with theoretical guarantees. It is shown that the proposed testing procedure maintains a desired false discovery rate (FDR) and false discovery proportion (FDP) under mild regularity conditions. We also provide results on the magnitudes of the signals that can be detected with high power. Simulation studies demonstrate that the proposed procedure can be substantially more powerful than existing FDR controlling procedures in the presence of correlation of unknown structure. The proposed multiple testing procedure is applied to a PheWAS of two auto-immune genetic markers using a rheumatoid arthritis patient cohort constructed from the electronic medical records of Partners Healthcare System.

13.

Optimal Estimation of Genetic Relatedness in High-dimensional Linear Models.

Guo, Zijian; Wang, Wanjie; Cai, T Tony; Li, Hongzhe.

J Am Stat Assoc ; 114(525): 358-369, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-38434789

RESUMO

Estimating the genetic relatedness between two traits based on the genome-wide association data is an important problem in genetics research. In the framework of high-dimensional linear models, we introduce two measures of genetic relatedness and develop optimal estimators for them. One is genetic covariance, which is defined to be the inner product of the two regression vectors, and another is genetic correlation, which is a normalized inner product by their lengths. We propose functional de-biased estimators (FDEs), which consist of an initial estimation step with the plug-in scaled Lasso estimator, and a further bias correction step. We also develop estimators of the quadratic functionals of the regression vectors, which can be used to estimate the heritability of each trait. The estimators are shown to be minimax rate-optimal and can be efficiently implemented. Simulation results show that FDEs provide better estimates of the genetic relatedness than simple plug-in estimates. FDE is also applied to an analysis of a yeast segregant data set with multiple traits to estimate the genetic relatedness among these traits.

14.

Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions.

Xia, Yin; Cai, Tianxi; Cai, T Tony.

J Am Stat Assoc ; 113(521): 328-339, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29881130

RESUMO

Making accurate inference for gene regulatory networks, including inferring about pathway by pathway interactions, is an important and difficult task. Motivated by such genomic applications, we consider multiple testing for conditional dependence between subgroups of variables. Under a Gaussian graphical model framework, the problem is translated into simultaneous testing for a collection of submatrices of a high-dimensional precision matrix with each submatrix summarizing the dependence structure between two subgroups of variables. A novel multiple testing procedure is proposed and both theoretical and numerical properties of the procedure are investigated. Asymptotic null distribution of the test statistic for an individual hypothesis is established and the proposed multiple testing procedure is shown to asymptotically control the false discovery rate (FDR) and false discovery proportion (FDP) at the pre-specified level under regularity conditions. Simulations show that the procedure works well in controlling the FDR and has good power in detecting the true interactions. The procedure is applied to a breast cancer gene expression study to identify between pathway interactions.

15.

Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions.

Xia, Yin; Cai, Tianxi; Cai, T Tony.

Stat Sin ; 28: 63-92, 2018 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-29386856

RESUMO

Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.

16.

Joint testing and false discovery rate control in high-dimensional multivariate regression.

Xia, Yin; Cai, T Tony; Li, Hongzhe.

Biometrika ; 105(2): 249-269, 2018 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-30799872

RESUMO

Multivariate regression with high-dimensional covariates has many applications in genomic and genetic research, in which some covariates are expected to be associated with multiple responses. This paper considers joint testing for regression coefficients over multiple responses and develops simultaneous testing methods with false discovery rate control. The test statistic is based on inverse regression and bias-corrected group lasso estimates of the regression coefficients and is shown to have an asymptotic chi-squared null distribution. A row-wise multiple testing procedure is developed to identify the covariates associated with the responses. The procedure is shown to control the false discovery proportion and false discovery rate at a prespecified level asymptotically. Simulations demonstrate the gain in power, relative to entrywise testing, in detecting the covariates associated with the responses. The test is applied to an ovarian cancer dataset to identify the microRNA regulators that regulate protein expression.

17.

Weighted False Discovery Rate Control in Large-Scale Multiple Testing.

Basu, Pallavi; Cai, T Tony; Das, Kiranmoy; Sun, Wenguang.

J Am Stat Assoc ; 113(523): 1172-1183, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-31011234

RESUMO

The use of weights provides an effective strategy to incorporate prior domain knowledge in large-scale inference. This paper studies weighted multiple testing in a decision-theoretic framework. We develop oracle and data-driven procedures that aim to maximize the expected number of true positives subject to a constraint on the weighted false discovery rate. The asymptotic validity and optimality of the proposed methods are established. The results demonstrate that incorporating informative domain knowledge enhances the interpretability of results and precision of inference. Simulation studies show that the proposed method controls the error rate at the nominal level, and the gain in power over existing methods is substantial in many settings. An application to a genome-wide association study is discussed.

18.

Optimal detection of weak positive latent dependence between two sequences of multiple tests.

Zhao, Sihai Dave; Cai, T Tony; Li, Hongzhe.

J Multivar Anal ; 160: 169-184, 2017 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-29203948

RESUMO

It is frequently of interest to jointly analyze two paired sequences of multiple tests. This paper studies the problem of detecting whether there are more pairs of tests that are significant in both sequences than would be expected by chance. The asymptotic detection boundary is derived in terms of parameters such as the sparsity of non-null cases in each sequence, the effect sizes of the signals, and the magnitude of the dependence between the two sequences. A new test for detecting weak dependence is also proposed, shown to be asymptotically adaptively optimal, studied in simulations, and applied to study genetic pleiotropy in 10 pediatric autoimmune diseases.

19.

Sparse simultaneous signal detection for identifying genetically controlled disease genes.

Zhao, Sihai Dave; Cai, T Tony; Cappola, Thomas P; Margulies, Kenneth B; Li, Hongzhe.

J Am Stat Assoc ; 112(519): 1032-1046, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-29375169

RESUMO

Genome-wide association studies (GWAS) and differential expression analyses have had limited success in finding genes that cause complex diseases such as heart failure (HF), a leading cause of death in the United States. This paper proposes a new statistical approach that integrates GWAS and expression quantitative trait loci (eQTL) data to identify important HF genes. For such genes, genetic variations that perturb its expression are also likely to influence disease risk. The proposed method thus tests for the presence of simultaneous signals: SNPs that are associated with the gene's expression as well as with disease. An analytic expression for the p-value is obtained, and the method is shown to be asymptotically adaptively optimal under certain conditions. It also allows the GWAS and eQTL data to be collected from different groups of subjects, enabling investigators to integrate public resources with their own data. Simulation experiments show that it can be more powerful than standard approaches and also robust to linkage disequilibrium between variants. The method is applied to an extensive analysis of HF genomics and identifies several genes with biological evidence for being functionally relevant in the etiology of HF. It is implemented in the R package ssa.

20.

Phenome-Wide Association Study of Autoantibodies to Citrullinated and Noncitrullinated Epitopes in Rheumatoid Arthritis.

Liao, Katherine P; Sparks, Jeffrey A; Hejblum, Boris P; Kuo, I-Hsin; Cui, Jing; Lahey, Lauren J; Cagan, Andrew; Gainer, Vivian S; Liu, Weidong; Cai, T Tony; Sokolove, Jeremy; Cai, Tianxi.

Arthritis Rheumatol ; 69(4): 742-749, 2017 04.

Artigo em Inglês | MEDLINE | ID: mdl-27792870

RESUMO

OBJECTIVE: Patients with rheumatoid arthritis (RA) develop autoantibodies against a spectrum of antigens, but the clinical significance of these autoantibodies is unclear. Using a phenome-wide association study (PheWAS) approach, we examined the association between autoantibodies and clinical subphenotypes of RA. METHODS: This study was conducted in a cohort of RA patients identified from the electronic medical records (EMRs) of 2 tertiary care centers. Using a published multiplex bead assay, we measured 36 autoantibodies targeting epitopes implicated in RA. We extracted all International Classification of Diseases, Ninth Revision (ICD-9) codes for each subject and grouped them into disease categories (PheWAS codes), using a published method. We tested for the association of each autoantibody (grouped by the targeted protein) with PheWAS codes. To determine significant associations (at a false discovery rate [FDR] of ≤0.1), we reviewed the medical records of 50 patients with each PheWAS code to determine positive predictive values (PPVs). RESULTS: We studied 1,006 RA patients; the mean ± SD age of the patients was 61.0 ± 12.9 years, and 79.0% were female. A total of 3,568 unique ICD-9 codes were grouped into 625 PheWAS codes; the 206 PheWAS codes with a prevalence of ≥3% were studied. Using the PheWAS method, we identified 24 significant associations of autoantibodies to epitopes at an FDR of ≤0.1. The associations that were strongest and had the highest PPV for the PheWAS code were autoantibodies against fibronectin and obesity (P = 6.1 × 10-4 , PPV 100%), and that between fibrinogen and pneumonopathy (P = 2.7 × 10-4 , PPV 96%). Pneumonopathy codes included diagnoses for cryptogenic organizing pneumonia and obliterative bronchiolitis. CONCLUSION: We demonstrated application of a bioinformatics method, the PheWAS, to screen for the clinical significance of RA-related autoantibodies. Using the PheWAS approach, we identified potentially significant links between variations in the levels of autoantibodies and comorbidities of interest in RA.

Assuntos

Artrite Reumatoide/genética , Artrite Reumatoide/imunologia , Autoanticorpos/genética , Epitopos , Peptídeos Cíclicos/imunologia , Feminino , Estudo de Associação Genômica Ampla , Humanos , Masculino , Pessoa de Meia-Idade , Fenótipo

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA