RESUMO
BACKGROUND: There has been a considerable advancement in AI technologies like LLM and machine learning to support biomedical knowledge discovery. MAIN BODY: We propose a novel biomedical neural search service called 'VAIV Bio-Discovery', which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles with information related to chemical compound/drugs, gene/proteins, diseases, and their interactions (chemical compounds/drugs-proteins/gene including drugs-targets, drug-drug, and drug-disease). To provide comprehensive knowledge, the system offers four search options: basic search, entity and interaction search, and natural language search. We employ T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block. It also assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation (RAG). The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25. CONCLUSION: As a result, our system can better understand the context, semantics and relationships between terms within the document, enhancing search accuracy. This research contributes to the rapidly evolving biomedical field by introducing a new service to access and discover relevant knowledge.
Assuntos
Processamento de Linguagem Natural , Mineração de Dados/métodos , Descoberta do Conhecimento/métodos , PubMed , Ferramenta de Busca , Aprendizado de Máquina , Armazenamento e Recuperação da Informação/métodos , Redes Neurais de ComputaçãoRESUMO
INTRODUCTION: The ambiguity of biomedical abbreviations is one of the challenges in biomedical text mining systems. In particular, the handling of term variants and abbreviations without nearby definitions is a critical issue. In this study, we adopt the concepts of topic of document and word link to disambiguate biomedical abbreviations. METHODS: We newly suggest the link topic model inspired by the latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words. Thus, the most probable expansions with respect to abbreviations of a given abstract are determined by word-topic, document-topic, and word-link distributions estimated from a document collection through the link topic model. The model allows two distinct modes of word generation to incorporate semantic dependencies among words, particularly long form words of abbreviations and their sentential co-occurring words; a word can be generated either dependently on the long form of the abbreviation or independently. The semantic dependency between two words is defined as a link and a new random parameter for the link is assigned to each word as well as a topic parameter. Because the link status indicates whether the word constitutes a link with a given specific long form, it has the effect of determining whether a word forms a unigram or a skipping/consecutive bigram with respect to the long form. Furthermore, we place a constraint on the model so that a word has the same topic as a specific long form if it is generated in reference to the long form. Consequently, documents are generated from the two hidden parameters, i.e. topic and link, and the most probable expansion of a specific abbreviation is estimated from the parameters. RESULTS: Our model relaxes the bag-of-words assumption of the standard topic model in which the word order is neglected, and it captures a richer structure of text than does the standard topic model by considering unigrams and semantically associated bigrams simultaneously. The addition of semantic links improves the disambiguation accuracy without removing irrelevant contextual words and reduces the parameter space of massive skipping or consecutive bigrams. The link topic model achieves 98.42% disambiguation accuracy on 73,505 MEDLINE abstracts with respect to 21 three letter abbreviations and their 139 distinct long forms.
Assuntos
Algoritmos , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação , Linguística , MEDLINE , Medical Subject Headings , Modelos Estatísticos , Probabilidade , Reprodutibilidade dos Testes , Software , Máquina de Vetores de Suporte , Unified Medical Language SystemRESUMO
The identification of drug-drug and chemical-protein interactions is essential for understanding unpredictable changes in the pharmacological effects of drugs and mechanisms of diseases and developing therapeutic drugs. In this study, we extract drug-related interactions from the DDI (Drug-Drug Interaction) Extraction-2013 Shared Task dataset and the BioCreative ChemProt (Chemical-Protein) dataset using various transfer transformers. We propose BERTGAT that uses a graph attention network (GAT) to take into account the local structure of sentences and embedding features of nodes under the self-attention scheme and investigate whether incorporating syntactic structure can help relation extraction. In addition, we suggest T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the relation classification problem by removing the self-attention layer in the decoder block. Furthermore, we evaluated the potential of biomedical relation extraction of GPT-3 (Generative Pre-trained Transformer) using GPT-3 variant models. As a result, T5slim_dec, which is a model with a tailored decoder designed for classification problems within the T5 architecture, demonstrated very promising performances for both tasks. We achieved an accuracy of 91.15% in the DDI dataset and an accuracy of 94.29% for the CPR (Chemical-Protein Relation) class group in ChemProt dataset. However, BERTGAT did not show a significant performance improvement in the aspect of relation extraction. We demonstrated that transformer-based approaches focused only on relationships between words are implicitly eligible to understand language well without additional knowledge such as structural information.
RESUMO
Accurate Medical Subject Headings (MeSH)annotation is an important issue for researchers in terms of effective information retrieval and knowledge discovery in the biomedical literature. We have developed a powerful dual triggered correspondence topic (DTCT)model for MeSH annotated articles. In our model, two types of data are assumed to be generated by the same latent topic factors and words in abstracts and titles serve as descriptions of the other type, MeSH terms. Our model allows the generation of MeSHs in abstracts to be triggered either by general document topics or by document-specific "special" word distributions in a probabilistic manner, allowing for a trade-off between the benefits of topic-based abstraction and specific word matching. In order to relax the topic influences of non-topical words or domain-frequent words in text description, we integrated the discriminative feature of Okapi BM25 into word sampling probability. This allows the model to choose keywords, which stand out from others, in order to generate MeSH terms. We further incorporate prior knowledge about relations between word and MeSH in DTCT with phi-coefficient to improve topic coherence. We demonstrated the model's usefulness in automatic MeSH annotation. Our model obtained 0.62 F-score 150,00 MEDLINE test set and showed a strength in recall rate. Specially, it yielded competitive performances in an integrated probabilistic environment without additional post-processing for filtering MeSHs.
Assuntos
Armazenamento e Recuperação da Informação , Medical Subject Headings , MEDLINE , ProbabilidadeRESUMO
BACKGROUND: The construction of interaction networks between proteins is central to understanding the underlying biological processes. However, since many useful relations are excluded in databases and remain hidden in raw text, a study on automatic interaction extraction from text is important in bioinformatics field. RESULTS: Here, we suggest two kinds of kernel methods for genic interaction extraction, considering the structural aspects of sentences. First, we improve our prior dependency kernel by modifying the kernel function so that it can involve various substructures in terms of (1) e-walks, (2) partial match, (3) non-contiguous paths, and (4) different significance of substructures. Second, we propose the walk-weighted subsequence kernel to parameterize non-contiguous syntactic structures as well as semantic roles and lexical features, which makes learning structural aspects from a small amount of training data effective. Furthermore, we distinguish the significances of parameters such as syntactic locality, semantic roles, and lexical features by varying their weights. CONCLUSIONS: We addressed the genic interaction problem with various dependency kernels and suggested various structural kernel scenarios based on the directed shortest dependency path connecting two entities. Consequently, we obtained promising results over genic interaction data sets with the walk-weighted subsequence kernel. The results are compared using automatically parsed third party protein-protein interaction (PPI) data as well as perfectly syntactic labeled PPI data.
Assuntos
Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Proteínas/metabolismo , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão/métodosRESUMO
MOTIVATION: Automatic knowledge discovery and efficient information access such as named entity recognition and relation extraction between entities have recently become critical issues in the biomedical literature. However, the inherent difficulty of the relation extraction task, mainly caused by the diversity of natural language, is further compounded in the biomedical domain because biomedical sentences are commonly long and complex. In addition, relation extraction often involves modeling long range dependencies, discontiguous word patterns and semantic relations for which the pattern-based methodology is not directly applicable. RESULTS: In this article, we shift the focus of biomedical relation extraction from the problem of pattern extraction to the problem of kernel construction. We suggest four kernels: predicate, walk, dependency and hybrid kernels to adequately encapsulate information required for a relation prediction based on the sentential structures involved in two entities. For this purpose, we view the dependency structure of a sentence as a graph, which allows the system to deal with an essential one from the complex syntactic structure by finding the shortest path between entities. The kernels we suggest are augmented gradually from the flat features descriptions to the structural descriptions of the shortest paths. As a result, we obtain a very promising result, a 77.5 F-score with the walk kernel on the Language Learning in Logic (LLL) 05 genic interaction shared task. AVAILABILITY: The used algorithms are free for use for academic research and are available from our Web site http://mllab.sogang.ac.kr/ approximately shkim/LLL05.tar.gz.