Probabilistic count matrix factorization for single cell expression data analysis.

Durif, Ghislain; Modolo, Laurent; Mold, Jeff E; Lambert-Lacroix, Sophie; Picard, Franck

Durif, Ghislain; Modolo, Laurent; Mold, Jeff E; Lambert-Lacroix, Sophie; Picard, Franck.

Afiliação

Durif G; Univ Lyon, Université Lyon 1, CNRS, LBBE UMR 5558, F Villeurbanne, France.
Modolo L; Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK UMR 5224, F Grenoble, France.
Mold JE; Université de Montpellier, CNRS, IMAG UMR 5149, F Montpellier, France.
Lambert-Lacroix S; Univ Lyon, Université Lyon 1, CNRS, LBBE UMR 5558, F Villeurbanne, France.
Picard F; Univ Lyon, ENS Lyon, Université Lyon 1, CNRS, LBMC UMR 5239, F Lyon, France.

Bioinformatics ; 35(20): 4011-4019, 2019 10 15.

Article em En | MEDLINE | ID: mdl-30865271

ABSTRACT

ABSTRACT

MOTIVATION The development of high-throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. Principal component analysis (PCA) is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.

RESULTS:

We propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression data. AVAILABILITY AND IMPLEMENTATION Our work is implemented in the pCMF R-package (https//github.com/gdurif/pCMF). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Análise de Dados; Software; Algoritmos; Sequenciamento de Nucleotídeos em Larga Escala; Análise de Célula Única

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Análise de Dados Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Análise de Dados Idioma: En Ano de publicação: 2019 Tipo de documento: Article