BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework.

Zheng, Xiangwen; Du, Haijian; Luo, Xiaowei; Tong, Fan; Song, Wei; Zhao, Dongsheng

Zheng, Xiangwen; Du, Haijian; Luo, Xiaowei; Tong, Fan; Song, Wei; Zhao, Dongsheng.

Afiliação

Zheng X; Academy of Military Medical Sciences, Beijing, 100039, China.
Du H; Academy of Military Medical Sciences, Beijing, 100039, China.
Luo X; Academy of Military Medical Sciences, Beijing, 100039, China.
Tong F; Academy of Military Medical Sciences, Beijing, 100039, China.
Song W; Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China.
Zhao D; Academy of Military Medical Sciences, Beijing, 100039, China. dszhao@bmi.ac.cn.

BMC Bioinformatics ; 23(1): 501, 2022 Nov 22.

Article em En | MEDLINE | ID: mdl-36418937

ABSTRACT

ABSTRACT

BACKGROUND:

Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model.

RESULTS:

In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively.

CONCLUSION:

The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.

Assuntos

Idioma; Semântica; Fala; Conhecimento; Benchmarking

Palavras-chave

BioBERT; Biomedical named entity recognition; Contextual features; Graph attention network; SpaCy; Syntactic features; Text mining

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Semântica / Idioma Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2022 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google