Pesquisa | Portal Regional da BVS

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.

Islamaj, Rezarta; Wei, Chih-Hsuan; Lai, Po-Ting; Luo, Ling; Coss, Cathleen; Gokal Kochar, Preeti; Miliaras, Nicholas; Rodionov, Oleg; Sekiya, Keiko; Trinh, Dorothy; Whitman, Deborah; Lu, Zhiyong.

Database (Oxford) ; 20242024 Aug 09.

Artigo em Inglês | MEDLINE | ID: mdl-39126204

RESUMO

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Assuntos

Curadoria de Dados , Humanos , Curadoria de Dados/métodos , Mineração de Dados/métodos , Semântica , PubMed

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

Islamaj, Rezarta; Leaman, Robert; Cissel, David; Coss, Cathleen; Denicola, Joseph; Fisher, Carol; Guzman, Rob; Kochar, Preeti Gokal; Miliaras, Nicholas; Punske, Zoe; Sekiya, Keiko; Trinh, Dorothy; Whitman, Deborah; Schmidt, Susan; Lu, Zhiyong.

Database (Oxford) ; 20222022 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-36458799

RESUMO

The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e. appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.

Assuntos

Algoritmos , Mineração de Dados , Estados Unidos , Humanos , National Library of Medicine (U.S.) , PubMed , Bases de Dados Factuais

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature.

Islamaj, Rezarta; Leaman, Robert; Kim, Sun; Kwon, Dongseop; Wei, Chih-Hsuan; Comeau, Donald C; Peng, Yifan; Cissel, David; Coss, Cathleen; Fisher, Carol; Guzman, Rob; Kochar, Preeti Gokal; Koppel, Stella; Trinh, Dorothy; Sekiya, Keiko; Ward, Janice; Whitman, Deborah; Schmidt, Susan; Lu, Zhiyong.

Sci Data ; 8(1): 91, 2021 03 25.

Artigo em Inglês | MEDLINE | ID: mdl-33767203

RESUMO

Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.

Assuntos

Mineração de Dados/métodos , Compostos Orgânicos/classificação , Preparações Farmacêuticas/classificação , Software , Terminologia como Assunto , PubMed

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA