Búsqueda | Portal Regional de la BVS

Subject clustering by IF-PCA and several recent methods.

Chen, Dieyi; Jin, Jiashun; Ke, Zheng Tracy.

Front Genet ; 14: 1166404, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37287536

RESUMEN

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

A data harmonization pipeline to leverage external controls and boost power in GWAS.

Chen, Danfeng; Tashman, Katherine; Palmer, Duncan S; Neale, Benjamin; Roeder, Kathryn; Bloemendal, Alex; Churchhouse, Claire; Ke, Zheng Tracy.

Hum Mol Genet ; 31(3): 481-489, 2022 02 03.

Artículo en Inglés | MEDLINE | ID: mdl-34508597

RESUMEN

The use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27 517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn's disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required.

Asunto(s)

Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudios de Cohortes , Genotipo , Humanos , Polimorfismo de Nucleótido Simple/genética , Control de Calidad

QUADRO: A SUPERVISED DIMENSION REDUCTION METHOD VIA RAYLEIGH QUOTIENT OPTIMIZATION.

Fan, Jianqing; Ke, Zheng Tracy; Liu, Han; Xia, Lucy.

Ann Stat ; 43(4): 1498-1534, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26778864

RESUMEN

We propose a novel Rayleigh quotient based sparse quadratic dimension reduction method-named QUADRO (Quadratic Dimension Reduction via Rayleigh Optimization)-for analyzing high-dimensional data. Unlike in the linear setting where Rayleigh quotient optimization coincides with classification, these two problems are very different under nonlinear settings. In this paper, we clarify this difference and show that Rayleigh quotient optimization may be of independent scientific interests. One major challenge of Rayleigh quotient optimization is that the variance of quadratic statistics involves all fourth cross-moments of predictors, which are infeasible to compute for high-dimensional applications and may accumulate too many stochastic errors. This issue is resolved by considering a family of elliptical models. Moreover, for heavy-tail distributions, robust estimates of mean vectors and covariance matrices are employed to guarantee uniform convergence in estimating non-polynomially many parameters, even though only the fourth moments are assumed. Methodologically, QUADRO is based on elliptical models which allow us to formulate the Rayleigh quotient maximization as a convex optimization problem. Computationally, we propose an efficient linearized augmented Lagrangian method to solve the constrained optimization problem. Theoretically, we provide explicit rates of convergence in terms of Rayleigh quotient under both Gaussian and general elliptical models. Thorough numerical results on both synthetic and real datasets are also provided to back up our theoretical results.

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA