A sparse negative binomial mixture model for clustering RNA-seq count data.

Li, Yujia; Rahman, Tanbin; Ma, Tianzhou; Tang, Lu; Tseng, George C

Li, Yujia; Rahman, Tanbin; Ma, Tianzhou; Tang, Lu; Tseng, George C.

Afiliação

Li Y; Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Rahman T; Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Ma T; Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, USA.
Tseng GC; Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.

Biostatistics ; 24(1): 68-84, 2022 12 12.

Article em En | MEDLINE | ID: mdl-34363675

ABSTRACT

ABSTRACT

Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse $K$-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small $n$) with high-dimensional gene features (large $p$). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.

Assuntos

Modelos Estatísticos; Humanos; RNA-Seq; Teorema de Bayes; Análise por Conglomerados; Distribuição Normal

Palavras-chave

Cluster analysis; Feature selection; Gaussian mixture model; Sparse K-means

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Modelos Estatísticos Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: Biostatistics Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google