Your browser doesn't support javascript.
loading
A natural language processing system for the efficient extraction of cell markers.
Cheng, Peng; Peng, Yan; Zhang, Xiao-Ling; Chen, Sheng; Fang, Bin-Bin; Li, Yan-Ze; Sun, Yi-Min.
Afiliação
  • Cheng P; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Peng Y; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Zhang XL; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Chen S; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Fang BB; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Li YZ; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China. yanzeli@capitalbiotech.com.
  • Sun YM; Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China. yiminsun_pub@capitalbiotech.com.
Sci Rep ; 14(1): 21183, 2024 09 11.
Article em En | MEDLINE | ID: mdl-39261578
ABSTRACT
Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies.

Conclusions:

Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https//github.com/chengpeng1116/MarkerGeneBERT .
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Biomarcadores / Análise de Célula Única Limite: Animals / Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Biomarcadores / Análise de Célula Única Limite: Animals / Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article