scPretrain: multi-task self-supervised learning for cell-type classification.

Zhang, Ruiyi; Luo, Yunan; Ma, Jianzhu; Zhang, Ming; Wang, Sheng

Zhang, Ruiyi; Luo, Yunan; Ma, Jianzhu; Zhang, Ming; Wang, Sheng.

Afiliación

Zhang R; School of EECS, Peking University, Beijing, China.
Luo Y; Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Ma J; Department of Computer Science, Purdue University, West Lafayette, IN, USA.
Zhang M; Department of Biochemistry, Purdue University, West Lafayette, IN, USA.
Wang S; School of EECS, Peking University, Beijing, China.

Bioinformatics ; 38(6): 1607-1614, 2022 03 04.

Article en En | MEDLINE | ID: mdl-34999749

ABSTRACT

ABSTRACT

MOTIVATION Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell-type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this article, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell-type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset.

RESULTS:

We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell-type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers, such as random forest, logistic regression and support-vector machines. scPretrain is able to effectively utilize the massive amount of unlabeled data and be applied to annotating increasingly generated scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION The data and code underlying this article are available in scPretrain Multi-task self-supervised learning for cell type classification, at https//github.com/ruiyi-zhang/scPretrain and https//zenodo.org/record/5802306. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Asunto(s)

Bosques Aleatorios; Análisis de la Célula Individual; Análisis de la Célula Individual/métodos; Análisis por Conglomerados; Máquina de Vectores de Soporte

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Análisis de la Célula Individual / Bosques Aleatorios Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2022 Tipo del documento: Article País de afiliación: China

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google