Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.

Wang, Meiqi; Vijayaraghavan, Avish; Beck, Tim; Posma, Joram M

Wang, Meiqi; Vijayaraghavan, Avish; Beck, Tim; Posma, Joram M.

Afiliación

Wang M; Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K.
Vijayaraghavan A; Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K.
Beck T; UKRI Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K.
Posma JM; School of Medicine, University of Nottingham, Biodiscovery Institute, Nottingham NG7 2RD, U.K.

J Proteome Res ; 23(6): 1915-1925, 2024 Jun 07.

Article en En | MEDLINE | ID: mdl-38733346

ABSTRACT

ABSTRACT

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT 0.89, SciBERT 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https//github.com/omicsNLP/enzymeNER and https//zenodo.org/doi/10.5281/zenodo.10581586.

Asunto(s)

Algoritmos; Aprendizaje Profundo; Enzimas; Procesamiento de Lenguaje Natural; Anotación de Secuencia Molecular/métodos; Humanos; Minería de Datos/métodos

Palabras clave

biomedical natural language processing; deep learning; named entity recognition

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Algoritmos / Procesamiento de Lenguaje Natural / Enzimas / Aprendizaje Profundo Límite: Humans Idioma: En Revista: J Proteome Res Asunto de la revista: BIOQUIMICA Año: 2024 Tipo del documento: Article País de afiliación: Reino Unido

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google