A natural language processing system for the efficient extraction of cell markers.

Cheng, Peng; Peng, Yan; Zhang, Xiao-Ling; Chen, Sheng; Fang, Bin-Bin; Li, Yan-Ze; Sun, Yi-Min

Cheng, Peng; Peng, Yan; Zhang, Xiao-Ling; Chen, Sheng; Fang, Bin-Bin; Li, Yan-Ze; Sun, Yi-Min.

Afiliação

Cheng P; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
Peng Y; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
Zhang XL; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
Chen S; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
Fang BB; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
Li YZ; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China. yanzeli@capitalbiotech.com.
Sun YM; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China. yiminsun_pub@capitalbiotech.com.

Sci Rep ; 14(1): 21183, 2024 09 11.

Article em En | MEDLINE | ID: mdl-39261578

ABSTRACT

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies.

Conclusions:

Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https//github.com/chengpeng1116/MarkerGeneBERT .

Assuntos

Biomarcadores; Processamento de Linguagem Natural; Análise de Célula Única; Humanos; Animais; Análise de Célula Única/métodos; Camundongos; Análise de Sequência de RNA/métodos; Bases de Dados Genéticas; Biologia Computacional/métodos

Palavras-chave

Cell marker; Natural language processing; ScRNA-seq

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Biomarcadores / Análise de Célula Única Limite: Animals / Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google