A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data.

Li, Ziyi; Wang, Yizhuo; Ganan-Gomez, Irene; Colla, Simona; Do, Kim-Anh

Li, Ziyi; Wang, Yizhuo; Ganan-Gomez, Irene; Colla, Simona; Do, Kim-Anh.

Afiliación

Li Z; Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Wang Y; Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Ganan-Gomez I; Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Colla S; Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Do KA; Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.

Bioinformatics ; 38(21): 4885-4892, 2022 10 31.

Article en En | MEDLINE | ID: mdl-36083008

ABSTRACT

ABSTRACT

MOTIVATION Single-cell RNA sequencing (scRNA-seq) has been widely used to decompose complex tissues into functionally distinct cell types. The first and usually the most important step of scRNA-seq data analysis is to accurately annotate the cell labels. In recent years, many supervised annotation methods have been developed and shown to be more convenient and accurate than unsupervised cell clustering. One challenge faced by all the supervised annotation methods is the identification of the novel cell type, which is defined as the cell type that is not present in the training data, only exists in the testing data. Existing methods usually label the cells simply based on the correlation coefficients or confidence scores, which sometimes results in an excessive number of unlabeled cells.

RESULTS:

We developed a straightforward yet effective method combining autoencoder with iterative feature selection to automatically identify novel cells from scRNA-seq data. Our method trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors. By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, our method can accurately identify novel cells that are not present in the training data. We further combined this approach with a support vector machine to provide a complete solution for annotating the full range of cell types. Extensive numerical experiments using five real scRNA-seq datasets demonstrated favorable performance of the proposed method over existing methods serving similar purposes. AVAILABILITY AND IMPLEMENTATION Our R software package CAMLU is publicly available through the Zenodo repository (https//doi.org/10.5281/zenodo.7054422) or GitHub repository (https//github.com/ziyili20/CAMLU). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Asunto(s)

Perfilación de la Expresión Génica; Análisis de la Célula Individual; Análisis de Secuencia de ARN/métodos; Análisis de la Célula Individual/métodos; RNA-Seq; Perfilación de la Expresión Génica/métodos; Programas Informáticos; Aprendizaje Automático

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Perfilación de la Expresión Génica / Análisis de la Célula Individual Tipo de estudio: Prognostic_studies Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2022 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google