Exploiting and assessing multi-source data for supervised biomedical named entity recognition.

Galea, Dieter; Laponogov, Ivan; Veselkov, Kirill

Galea, Dieter; Laponogov, Ivan; Veselkov, Kirill.

Afiliação

Galea D; Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK.
Laponogov I; Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK.
Veselkov K; Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK.

Bioinformatics ; 34(14): 2474-2482, 2018 07 15.

Article em En | MEDLINE | ID: mdl-29538614

RESUMO

Motivation: Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed. Results: Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model 'overtraining') which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data. Availability and implementation: Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Mineração de Dados/métodos; Bases de Dados Factuais; Processamento de Linguagem Natural; Aprendizado de Máquina Supervisionado; PubMed

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Bases de Dados Factuais / Mineração de Dados / Aprendizado de Máquina Supervisionado Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2018 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google