Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance.

Fang, Cindy; Selega, Alina; Campbell, Kieran R

Fang, Cindy; Selega, Alina; Campbell, Kieran R.

Afiliação

Fang C; Lunenfeld-Tanenbaum Research Institute, Toronto, Canada.
Selega A; Program in Bioinformatics and Computational Biology, University of Toronto, Toronto, Canada.
Campbell KR; Present address: Department of Biostatistics, Johns Hopkins University, Baltimore, USA.

Genome Biol ; 25(1): 159, 2024 06 17.

Article em En | MEDLINE | ID: mdl-38886757

ABSTRACT

ABSTRACT

BACKGROUND:

The advent of single-cell RNA-sequencing (scRNA-seq) has driven significant computational methods development for all steps in the scRNA-seq data analysis pipeline, including filtering, normalization, and clustering. The large number of methods and their resulting parameter combinations has created a combinatorial set of possible pipelines to analyze scRNA-seq data, which leads to the obvious question which is best? Several benchmarking studies compare methods but frequently find variable performance depending on dataset and pipeline characteristics. Alternatively, the large number of scRNA-seq datasets along with advances in supervised machine learning raise a tantalizing possibility could the optimal pipeline be predicted for a given dataset?

RESULTS:

Here, we begin to answer this question by applying 288 scRNA-seq analysis pipelines to 86 datasets and quantifying pipeline success via a range of measures evaluating cluster purity and biological plausibility. We build supervised machine learning models to predict pipeline success given a range of dataset and pipeline characteristics. We find that prediction performance is significantly better than random and that in many cases pipelines predicted to perform well provide clustering outputs similar to expert-annotated cell type labels. We identify characteristics of datasets that correlate with strong prediction performance that could guide when such prediction models may be useful.

CONCLUSIONS:

Supervised machine learning models have utility for recommending analysis pipelines and therefore the potential to alleviate the burden of choosing from the near-infinite number of possibilities. Different aspects of datasets influence the predictive performance of such models which will further guide users.

Assuntos

RNA-Seq; Análise da Expressão Gênica de Célula Única; Animais; Humanos; Análise por Conglomerados; Biologia Computacional/métodos; Aprendizado de Máquina; RNA-Seq/métodos; Análise de Sequência de RNA/métodos; Aprendizado de Máquina Supervisionado

Palavras-chave

Automated machine learning; Benchmarking; Clustering; Single-cell RNA sequencing (scRNA-seq)

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: RNA-Seq / Análise da Expressão Gênica de Célula Única Limite: Animals / Humans Idioma: En Revista: Genome Biol Assunto da revista: BIOLOGIA MOLECULAR / GENETICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Canadá

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google