A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.

Huang, Dao-Ling; Zeng, Quanlei; Xiong, Yun; Liu, Shuixia; Pang, Chaoqun; Xia, Menglei; Fang, Ting; Ma, Yanli; Qiang, Cuicui; Zhang, Yi; Zhang, Yu; Li, Hong; Yuan, Yuying

Huang, Dao-Ling; Zeng, Quanlei; Xiong, Yun; Liu, Shuixia; Pang, Chaoqun; Xia, Menglei; Fang, Ting; Ma, Yanli; Qiang, Cuicui; Zhang, Yi; Zhang, Yu; Li, Hong; Yuan, Yuying.

Afiliação

Huang DL; BGI Research, Shenzhen, 518083, China. dlhuang1217@gmail.com.
Zeng Q; Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China. dlhuang1217@gmail.com.
Xiong Y; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Liu S; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Pang C; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Xia M; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Fang T; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Ma Y; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Qiang C; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Zhang Y; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Zhang Y; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Li H; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
Yuan Y; BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.

Interdiscip Sci ; 16(2): 333-344, 2024 Jun.

Article em En | MEDLINE | ID: mdl-38340264

ABSTRACT

ABSTRACT

We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.

Assuntos

Aprendizado Profundo; Processamento de Linguagem Natural; Humanos; Doenças Genéticas Inatas/genética; Publicações; Mineração de Dados/métodos

Palavras-chave

Data mining; Genomics; Name entity recognition; Natural language processing

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Aprendizado Profundo Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Revista: Interdiscip Sci Assunto da revista: BIOLOGIA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: China

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google