Búsqueda | BVS Bolivia

1.

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.

Islamaj, Rezarta; Wei, Chih-Hsuan; Lai, Po-Ting; Luo, Ling; Coss, Cathleen; Gokal Kochar, Preeti; Miliaras, Nicholas; Rodionov, Oleg; Sekiya, Keiko; Trinh, Dorothy; Whitman, Deborah; Lu, Zhiyong.

Database (Oxford) ; 20242024 Aug 09.

Artículo en Inglés | MEDLINE | ID: mdl-39126204

RESUMEN

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Asunto(s)

Curaduría de Datos , Humanos , Curaduría de Datos/métodos , Minería de Datos/métodos , Semántica , PubMed

2.

The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.

Islamaj, Rezarta; Lai, Po-Ting; Wei, Chih-Hsuan; Luo, Ling; Almeida, Tiago; Jonker, Richard A A; Conceição, Sofia I R; Sousa, Diana F; Phan, Cong-Phuoc; Chiang, Jung-Hsien; Li, Jiru; Pan, Dinghao; Meesawad, Wilailack; Tsai, Richard Tzong-Han; Sarol, M Janina; Hong, Gibong; Valiev, Airat; Tutubalina, Elena; Lee, Shao-Man; Hsu, Yi-Yu; Li, Mingjie; Verspoor, Karin; Lu, Zhiyong.

Database (Oxford) ; 20242024 Aug 08.

Artículo en Inglés | MEDLINE | ID: mdl-39114977

RESUMEN

The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.

Asunto(s)

Minería de Datos , Procesamiento de Lenguaje Natural , Humanos , Minería de Datos/métodos , Bases de Datos Factuales , Semántica

3.

Correspondence on "Comparison of literature mining tools for variant classification: Through the lens of 50 RYR1 variants" by Wermers et al.

Wei, Chih-Hsuan; Phan, Lon; Hefferon, Timothy; Landrum, Melissa; Rehm, Heidi L; Lu, Zhiyong.

Genet Med ; : 101208, 2024 Jul 05.

Artículo en Inglés | MEDLINE | ID: mdl-38973600

4.

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

Lai, Po-Ting; Coudert, Elisabeth; Aimo, Lucila; Axelsen, Kristian; Breuza, Lionel; de Castro, Edouard; Feuermann, Marc; Morgat, Anne; Pourcel, Lucille; Pedruzzi, Ivo; Poux, Sylvain; Redaschi, Nicole; Rivoire, Catherine; Sveshnikova, Anastasia; Wei, Chih-Hsuan; Leaman, Robert; Luo, Ling; Lu, Zhiyong; Bridge, Alan.

ArXiv ; 2024 Apr 22.

Artículo en Inglés | MEDLINE | ID: mdl-38903736

RESUMEN

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

5.

GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases.

Wang, Zhizheng; Jin, Qiao; Wei, Chih-Hsuan; Tian, Shubo; Lai, Po-Ting; Zhu, Qingqing; Day, Chi-Ping; Ross, Christina; Lu, Zhiyong.

ArXiv ; 2024 May 25.

Artículo en Inglés | MEDLINE | ID: mdl-38903746

RESUMEN

Gene set knowledge discovery is essential for advancing human functional genomics. Recent studies have shown promising performance by harnessing the power of Large Language Models (LLMs) on this task. Nonetheless, their results are subject to several limitations common in LLMs such as hallucinations. In response, we present GeneAgent, a first-of-its-kind language agent featuring self-verification capability. It autonomously interacts with various biological databases and leverages relevant domain knowledge to improve accuracy and reduce hallucination occurrences. Benchmarking on 1,106 gene sets from different sources, GeneAgent consistently outperforms standard GPT-4 by a significant margin. Moreover, a detailed manual review confirms the effectiveness of the self-verification module in minimizing hallucinations and generating more reliable analytical narratives. To demonstrate its practical utility, we apply GeneAgent to seven novel gene sets derived from mouse B2905 melanoma cell lines, with expert evaluations showing that GeneAgent offers novel insights into gene functions and subsequently expedites knowledge discovery.

6.

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.

Wei, Chih-Hsuan; Allot, Alexis; Lai, Po-Ting; Leaman, Robert; Tian, Shubo; Luo, Ling; Jin, Qiao; Wang, Zhizheng; Chen, Qingyu; Lu, Zhiyong.

Nucleic Acids Res ; 52(W1): W540-W546, 2024 Jul 05.

Artículo en Inglés | MEDLINE | ID: mdl-38572754

RESUMEN

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

Asunto(s)

PubMed , Inteligencia Artificial , Humanos , Programas Informáticos , Minería de Datos/métodos , Semántica , Internet

7.

Advancing entity recognition in biomedicine via instruction tuning of large language models.

Keloth, Vipina K; Hu, Yan; Xie, Qianqian; Peng, Xueqing; Wang, Yan; Zheng, Andrew; Selek, Melih; Raja, Kalpana; Wei, Chih Hsuan; Jin, Qiao; Lu, Zhiyong; Chen, Qingyu; Xu, Hua.

Bioinformatics ; 40(4)2024 Mar 29.

Artículo en Inglés | MEDLINE | ID: mdl-38514400

RESUMEN

MOTIVATION: Large Language Models (LLMs) have the potential to revolutionize the field of Natural Language Processing, excelling not only in text generation and reasoning tasks but also in their ability for zero/few-shot learning, swiftly adapting to new tasks with minimal fine-tuning. LLMs have also demonstrated great promise in biomedical and healthcare applications. However, when it comes to Named Entity Recognition (NER), particularly within the biomedical domain, LLMs fall short of the effectiveness exhibited by fine-tuned domain-specific models. One key reason is that NER is typically conceptualized as a sequence labeling task, whereas LLMs are optimized for text generation and reasoning tasks. RESULTS: We developed an instruction-based learning paradigm that transforms biomedical NER from a sequence labeling task into a generation task. This paradigm is end-to-end and streamlines the training and evaluation process by automatically repurposing pre-existing biomedical NER datasets. We further developed BioNER-LLaMA using the proposed paradigm with LLaMA-7B as the foundational LLM. We conducted extensive testing on BioNER-LLaMA across three widely recognized biomedical NER datasets, consisting of entities related to diseases, chemicals, and genes. The results revealed that BioNER-LLaMA consistently achieved higher F1-scores ranging from 5% to 30% compared to the few-shot learning capabilities of GPT-4 on datasets with different biomedical entities. We show that a general-domain LLM can match the performance of rigorously fine-tuned PubMedBERT models and PMC-LLaMA, biomedical-specific language model. Our findings underscore the potential of our proposed paradigm in developing general-domain LLMs that can rival SOTA performances in multi-task, multi-domain scenarios in biomedical and health applications. AVAILABILITY AND IMPLEMENTATION: Datasets and other resources are available at https://github.com/BIDS-Xu-Lab/BioNER-LLaMA.

Asunto(s)

Camélidos del Nuevo Mundo , Aprendizaje Profundo , Animales , Lenguaje , Procesamiento de Lenguaje Natural

8.

PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical Knowledge.

Wei, Chih-Hsuan; Allot, Alexis; Lai, Po-Ting; Leaman, Robert; Tian, Shubo; Luo, Ling; Jin, Qiao; Wang, Zhizheng; Chen, Qingyu; Lu, Zhiyong.

ArXiv ; 2024 Jan 19.

Artículo en Inglés | MEDLINE | ID: mdl-38410657

RESUMEN

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases, and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

9.

GNorm2: an improved gene name recognition and normalization system.

Wei, Chih-Hsuan; Luo, Ling; Islamaj, Rezarta; Lai, Po-Ting; Lu, Zhiyong.

Bioinformatics ; 39(10)2023 10 03.

Artículo en Inglés | MEDLINE | ID: mdl-37878810

RESUMEN

MOTIVATION: Gene name normalization is an important yet highly complex task in biomedical text mining research, as gene names can be highly ambiguous and may refer to different genes in different species or share similar names with other bioconcepts. This poses a challenge for accurately identifying and linking gene mentions to their corresponding entries in databases such as NCBI Gene or UniProt. While there has been a body of literature on the gene normalization task, few have addressed all of these challenges or make their solutions publicly available to the scientific community. RESULTS: Building on the success of GNormPlus, we have created GNorm2: a more advanced tool with optimized functions and improved performance. GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date. Our tool is freely available for download. AVAILABILITY AND IMPLEMENTATION: https://github.com/ncbi/GNorm2.

Asunto(s)

Minería de Datos , Minería de Datos/métodos , Bases de Datos Factuales

10.

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.

Lai, Po-Ting; Wei, Chih-Hsuan; Luo, Ling; Chen, Qingyu; Lu, Zhiyong.

J Biomed Inform ; 146: 104487, 2023 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-37673376

RESUMEN

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

11.

BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets.

Lai, Po-Ting; Wei, Chih-Hsuan; Luo, Ling; Chen, Qingyu; Lu, Zhiyong.

ArXiv ; 2023 Jun 19.

Artículo en Inglés | MEDLINE | ID: mdl-37502629

RESUMEN

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

12.

Tracking genetic variants in the biomedical literature using LitVar 2.0.

Allot, Alexis; Wei, Chih-Hsuan; Phan, Lon; Hefferon, Timothy; Landrum, Melissa; Rehm, Heidi L; Lu, Zhiyong.

Nat Genet ; 55(6): 901-903, 2023 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-37268776

13.

AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning.

Luo, Ling; Wei, Chih-Hsuan; Lai, Po-Ting; Leaman, Robert; Chen, Qingyu; Lu, Zhiyong.

Bioinformatics ; 39(5)2023 05 04.

Artículo en Inglés | MEDLINE | ID: mdl-37171899

RESUMEN

MOTIVATION: Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g. gene or disease). RESULTS: We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). AVAILABILITY AND IMPLEMENTATION: The source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.

Asunto(s)

Aprendizaje Profundo , Minería de Datos/métodos , Programas Informáticos , Lenguaje , PubMed

14.

Bioformer: an efficient transformer language model for biomedical text mining.

Fang, Li; Chen, Qingyu; Wei, Chih-Hsuan; Lu, Zhiyong; Wang, Kai.

ArXiv ; 2023 Feb 03.

Artículo en Inglés | MEDLINE | ID: mdl-36945685

RESUMEN

Pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art performance in natural language processing (NLP) tasks. Recently, BERT has been adapted to the biomedical domain. Despite the effectiveness, these models have hundreds of millions of parameters and are computationally expensive when applied to large-scale NLP applications. We hypothesized that the number of parameters of the original BERT can be dramatically reduced with minor impact on performance. In this study, we present Bioformer, a compact BERT model for biomedical text mining. We pretrained two Bioformer models (named Bioformer8L and Bioformer16L) which reduced the model size by 60% compared to BERTBase. Bioformer uses a biomedical vocabulary and was pre-trained from scratch on PubMed abstracts and PubMed Central full-text articles. We thoroughly evaluated the performance of Bioformer as well as existing biomedical BERT models including BioBERT and PubMedBERT on 15 benchmark datasets of four different biomedical NLP tasks: named entity recognition, relation extraction, question answering and document classification. The results show that with 60% fewer parameters, Bioformer16L is only 0.1% less accurate than PubMedBERT while Bioformer8L is 0.9% less accurate than PubMedBERT. Both Bioformer16L and Bioformer8L outperformed BioBERTBase-v1.1. In addition, Bioformer16L and Bioformer8L are 2-3 fold as fast as PubMedBERT/BioBERTBase-v1.1. Bioformer has been successfully deployed to PubTator Central providing gene annotations over 35 million PubMed abstracts and 5 million PubMed Central full-text articles. We make Bioformer publicly available via https://github.com/WGLab/bioformer, including pre-trained models, datasets, and instructions for downstream use.

15.

LitCovid in 2022: an information resource for the COVID-19 literature.

Chen, Qingyu; Allot, Alexis; Leaman, Robert; Wei, Chih-Hsuan; Aghaarabi, Elaheh; Guerrerio, John J; Xu, Lilly; Lu, Zhiyong.

Nucleic Acids Res ; 51(D1): D1512-D1518, 2023 01 06.

Artículo en Inglés | MEDLINE | ID: mdl-36350613

RESUMEN

LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/)-first launched in February 2020-is a first-of-its-kind literature hub for tracking up-to-date published research on COVID-19. The number of articles in LitCovid has increased from 55 000 to â¼300 000 over the past 2.5 years, with a consistent growth rate of â¼10 000 articles per month. In addition to the rapid literature growth, the COVID-19 pandemic has evolved dramatically. For instance, the Omicron variant has now accounted for over 98% of new infections in the United States. In response to the continuing evolution of the COVID-19 pandemic, this article describes significant updates to LitCovid over the last 2 years. First, we introduced the long Covid collection consisting of the articles on COVID-19 survivors experiencing ongoing multisystemic symptoms, including respiratory issues, cardiovascular disease, cognitive impairment, and profound fatigue. Second, we provided new annotations on the latest COVID-19 strains and vaccines mentioned in the literature. Third, we improved several existing features with more accurate machine learning algorithms for annotating topics and classifying articles relevant to COVID-19. LitCovid has been widely used with millions of accesses by users worldwide on various information needs and continues to play a critical role in collecting, curating and standardizing the latest knowledge on the COVID-19 literature.

Asunto(s)

COVID-19 , Bases de Datos Bibliográficas , Humanos , COVID-19/epidemiología , Pandemias , Síndrome Post Agudo de COVID-19 , SARS-CoV-2 , Estados Unidos

16.

Assigning species information to corresponding genes by a sequence labeling framework.

Luo, Ling; Wei, Chih-Hsuan; Lai, Po-Ting; Chen, Qingyu; Islamaj, Rezarta; Lu, Zhiyong.

Database (Oxford) ; 20222022 10 13.

Artículo en Inglés | MEDLINE | ID: mdl-36227127

RESUMEN

The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or an identifier by a text-mining algorithm. Existing methods typically rely on heuristic rules based on gene and species co-occurrence in the article, but their accuracy is suboptimal. We therefore developed a high-performance method, using a novel deep learning-based framework, to identify whether there is a relation between a gene and a species. Instead of the traditional binary classification framework in which all possible pairs of genes and species in the same article are evaluated, we treat the problem as a sequence labeling task such that only a fraction of the pairs needs to be considered. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method for the species assignment task (from 65.8-81.3% in accuracy). The source code and data for species assignment are freely available. Database URL https://github.com/ncbi/SpeciesAssignment.

Asunto(s)

Minería de Datos , Programas Informáticos , Algoritmos , Benchmarking , Minería de Datos/métodos , Bases de Datos Factuales

17.

tmVar 3.0: an improved variant concept recognition and normalization tool.

Wei, Chih-Hsuan; Allot, Alexis; Riehle, Kevin; Milosavljevic, Aleksandar; Lu, Zhiyong.

Bioinformatics ; 38(18): 4449-4451, 2022 09 15.

Artículo en Inglés | MEDLINE | ID: mdl-35904569

RESUMEN

MOTIVATION: Previous studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision. RESULT: We propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant-related entities (e.g. allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download. AVAILABILITY AND IMPLEMENTATION: https://github.com/ncbi/tmVar3.

Asunto(s)

Minería de Datos , Publicaciones , PubMed , Genómica

18.

A sequence labeling framework for extracting drug-protein relations from biomedical literature.

Luo, Ling; Lai, Po-Ting; Wei, Chih-Hsuan; Lu, Zhiyong.

Database (Oxford) ; 20222022 07 19.

Artículo en Inglés | MEDLINE | ID: mdl-35856889

RESUMEN

Automatic extracting interactions between chemical compound/drug and gene/protein are significantly beneficial to drug discovery, drug repurposing, drug design and biomedical knowledge graph construction. To promote the development of the relation extraction between drug and protein, the BioCreative VII challenge organized the DrugProt track. This paper describes the approach we developed for this task. In addition to the conventional text classification framework that has been widely used in relation extraction tasks, we propose a sequence labeling framework to drug-protein relation extraction. We first comprehensively compared the cutting-edge biomedical pre-trained language models for both frameworks. Then, we explored several ensemble methods to further improve the final performance. In the evaluation of the challenge, our best submission (i.e. the ensemble of models in two frameworks via major voting) achieved the F1-score of 0.795 on the official test set. Further, we realized the sequence labeling framework is more efficient and achieves better performance than the text classification framework. Finally, our ensemble of the sequence labeling models with majority voting achieves the best F1-score of 0.800 on the test set. DATABASE URL: https://github.com/lingluodlut/BioCreativeVII_DrugProt.

Asunto(s)

Minería de Datos , Proteínas , Minería de Datos/métodos , Bases de Datos Factuales , Lenguaje , Publicaciones

19.

BioRED: a rich biomedical relation extraction dataset.

Luo, Ling; Lai, Po-Ting; Wei, Chih-Hsuan; Arighi, Cecilia N; Lu, Zhiyong.

Brief Bioinform ; 23(5)2022 09 20.

Artículo en Inglés | MEDLINE | ID: mdl-35849818

RESUMEN

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.

Asunto(s)

Algoritmos , Minería de Datos , Proteínas , PubMed

20.

Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing.

Chen, Qingyu; Leaman, Robert; Allot, Alexis; Luo, Ling; Wei, Chih-Hsuan; Yan, Shankai; Lu, Zhiyong.

Annu Rev Biomed Data Sci ; 4: 313-339, 2021 07 20.

Artículo en Inglés | MEDLINE | ID: mdl-34465169

RESUMEN

The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.

Asunto(s)

COVID-19/epidemiología , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Comunicación , Minería de Datos/métodos , Conjuntos de Datos como Asunto , Emociones , Humanos , Descubrimiento del Conocimiento , Pandemias , Publicaciones Periódicas como Asunto , Programas Informáticos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA