Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
J Am Med Inform Assoc ; 31(9): 2010-2018, 2024 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-38904416

RESUMO

OBJECTIVE: To investigate the demonstration in large language models (LLMs) for biomedical relation extraction. This study introduces a framework comprising three types of adaptive tuning methods to assess their impacts and effectiveness. MATERIALS AND METHODS: Our study was conducted in two phases. Initially, we analyzed a range of demonstration components vital for LLMs' biomedical data capabilities, including task descriptions and examples, experimenting with various combinations. Subsequently, we introduced the LLM instruction-example adaptive prompting (LEAP) framework, including instruction adaptive tuning, example adaptive tuning, and instruction-example adaptive tuning methods. This framework aims to systematically investigate both adaptive task descriptions and adaptive examples within the demonstration. We assessed the performance of the LEAP framework on the DDI, ChemProt, and BioRED datasets, employing LLMs such as Llama2-7b, Llama2-13b, and MedLLaMA_13B. RESULTS: Our findings indicated that Instruction + Options + Example and its expanded form substantially improved F1 scores over the standard Instruction + Options mode for zero-shot LLMs. The LEAP framework, particularly through its example adaptive prompting, demonstrated superior performance over conventional instruction tuning across all models. Notably, the MedLLAMA_13B model achieved an exceptional F1 score of 95.13 on the ChemProt dataset using this method. Significant improvements were also observed in the DDI 2013 and BioRED datasets, confirming the method's robustness in sophisticated data extraction scenarios. CONCLUSION: The LEAP framework offers a compelling strategy for enhancing LLM training strategies, steering away from extensive fine-tuning towards more dynamic and contextually enriched prompting methodologies, showcasing in biomedical relation extraction.


Assuntos
Processamento de Linguagem Natural , Mineração de Dados/métodos , Conjuntos de Dados como Assunto
2.
J Biomed Inform ; 156: 104676, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38876451

RESUMO

Biomedical relation extraction has long been considered a challenging task due to the specialization and complexity of biomedical texts. Syntactic knowledge has been widely employed in existing research to enhance relation extraction, providing guidance for the semantic understanding and text representation of models. However, the utilization of syntactic knowledge in most studies is not exhaustive, and there is often a lack of fine-grained noise reduction, leading to confusion in relation classification. In this paper, we propose an attention generator that comprehensively considers both syntactic dependency type information and syntactic position information to distinguish the importance of different dependency connections. Additionally, we integrate positional information, dependency type information, and word representations together to introduce location-enhanced syntactic knowledge for guiding our biomedical relation extraction. Experimental results on three widely used English benchmark datasets in the biomedical domain consistently outperform a range of baseline models, demonstrating that our approach not only makes full use of syntactic knowledge but also effectively reduces the impact of noisy words.


Assuntos
Processamento de Linguagem Natural , Semântica , Mineração de Dados/métodos , Algoritmos , Humanos
3.
J Biomed Inform ; 155: 104658, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38782169

RESUMO

OBJECTIVE: Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB. METHODS: We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts. RESULTS: Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct. CONCLUSION: These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Semântica , Mineração de Dados/métodos , MEDLINE , PubMed , Algoritmos , Humanos , Bases de Dados Factuais
4.
J Healthc Inform Res ; 8(2): 206-224, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38681754

RESUMO

Biomedical relation extraction (RE) is critical in constructing high-quality knowledge graphs and databases as well as supporting many downstream text mining applications. This paper explores prompt tuning on biomedical RE and its few-shot scenarios, aiming to propose a simple yet effective model for this specific task. Prompt tuning reformulates natural language processing (NLP) downstream tasks into masked language problems by embedding specific text prompts into the original input, facilitating the adaption of pre-trained language models (PLMs) to better address these tasks. This study presents a customized prompt tuning model designed explicitly for biomedical RE, including its applicability in few-shot learning contexts. The model's performance was rigorously assessed using the chemical-protein relation (CHEMPROT) dataset from BioCreative VI and the drug-drug interaction (DDI) dataset from SemEval-2013, showcasing its superior performance over conventional fine-tuned PLMs across both datasets, encompassing few-shot scenarios. This observation underscores the effectiveness of prompt tuning in enhancing the capabilities of conventional PLMs, though the extent of enhancement may vary by specific model. Additionally, the model demonstrated a harmonious balance between simplicity and efficiency, matching state-of-the-art performance without needing external knowledge or extra computational resources. The pivotal contribution of our study is the development of a suitably designed prompt tuning model, highlighting prompt tuning's effectiveness in biomedical RE. It offers a robust, efficient approach to the field's challenges and represents a significant advancement in extracting complex relations from biomedical texts. Supplementary Information: The online version contains supplementary material available at 10.1007/s41666-024-00162-9.

5.
BMC Bioinformatics ; 24(1): 486, 2023 Dec 19.
Artigo em Inglês | MEDLINE | ID: mdl-38114906

RESUMO

BACKGROUND: Automatic and accurate extraction of diverse biomedical relations from literature is a crucial component of bio-medical text mining. Currently, stacking various classification networks on pre-trained language models to perform fine-tuning is a common framework to end-to-end solve the biomedical relation extraction (BioRE) problem. However, the sequence-based pre-trained language models underutilize the graphical topology of language to some extent. In addition, sequence-oriented deep neural networks have limitations in processing graphical features. RESULTS: In this paper, we propose a novel method for sentence-level BioRE task, BioEGRE (BioELECTRA and Graph pointer neural net-work for Relation Extraction), aimed at leveraging the linguistic topological features. First, the biomedical literature is preprocessed to retain sentences involving pre-defined entity pairs. Secondly, SciSpaCy is employed to conduct dependency parsing; sentences are modeled as graphs based on the parsing results; BioELECTRA is utilized to generate token-level representations, which are modeled as attributes of nodes in the sentence graphs; a graph pointer neural network layer is employed to select the most relevant multi-hop neighbors to optimize representations; a fully-connected neural network layer is employed to generate the sentence-level representation. Finally, the Softmax function is employed to calculate the probabilities. Our proposed method is evaluated on three BioRE tasks: a multi-class (CHEMPROT) and two binary tasks (GAD and EU-ADR). The results show that our method achieves F1-scores of 79.97% (CHEMPROT), 83.31% (GAD), and 83.51% (EU-ADR), surpassing the performance of existing state-of-the-art models. CONCLUSION: The experimental results on 3 biomedical benchmark datasets demonstrate the effectiveness and generalization of BioEGRE, which indicates that linguistic topology and a graph pointer neural network layer explicitly improve performance for BioRE tasks.


Assuntos
Idioma , Redes Neurais de Computação , Mineração de Dados , Linguística , Processamento de Linguagem Natural
6.
J Med Internet Res ; 25: e48115, 2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37632414

RESUMO

BACKGROUND: Biomedical relation extraction (RE) is of great importance for researchers to conduct systematic biomedical studies. It not only helps knowledge mining, such as knowledge graphs and novel knowledge discovery, but also promotes translational applications, such as clinical diagnosis, decision-making, and precision medicine. However, the relations between biomedical entities are complex and diverse, and comprehensive biomedical RE is not yet well established. OBJECTIVE: We aimed to investigate and improve large-scale RE with diverse relation types and conduct usability studies with application scenarios to optimize biomedical text mining. METHODS: Data sets containing 125 relation types with different entity semantic levels were constructed to evaluate the impact of entity semantic information on RE, and performance analysis was conducted on different model architectures and domain models. This study also proposed a continued pretraining strategy and integrated models with scripts into a tool. Furthermore, this study applied RE to the COVID-19 corpus with article topics and application scenarios of clinical interest to assess and demonstrate its biological interpretability and usability. RESULTS: The performance analysis revealed that RE achieves the best performance when the detailed semantic type is provided. For a single model, PubMedBERT with continued pretraining performed the best, with an F1-score of 0.8998. Usability studies on COVID-19 demonstrated the interpretability and usability of RE, and a relation graph database was constructed, which was used to reveal existing and novel drug paths with edge explanations. The models (including pretrained and fine-tuned models), integrated tool (Docker), and generated data (including the COVID-19 relation graph database and drug paths) have been made publicly available to the biomedical text mining community and clinical researchers. CONCLUSIONS: This study provided a comprehensive analysis of RE with diverse relation types. Optimized RE models and tools for diverse relation types were developed, which can be widely used in biomedical text mining. Our usability studies provided a proof-of-concept demonstration of how large-scale RE can be leveraged to facilitate novel research.


Assuntos
COVID-19 , Humanos , Mineração de Dados , Bases de Dados Factuais , Conhecimento , Medicina de Precisão
7.
J Biomed Inform ; 144: 104445, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37467835

RESUMO

In biomedical literature, cross-sentence texts can usually express rich knowledge, and extracting the interaction relation between entities from cross-sentence texts is of great significance to biomedical research. However, compared with single sentence, cross-sentence text has a longer sequence length, so the research on cross-sentence text information extraction should focus more on learning the context dependency structural information. Nowadays, it is still a challenge to handle global dependencies and structural information of long sequences effectively, and graph-oriented modeling methods have received more and more attention recently. In this paper, we propose a new graph attention network guided by syntactic dependency relationship (SR-GAT) for extracting biomedical relation from the cross-sentence text. It allows each node to pay attention to other nodes in its neighborhood, regardless of the sequence length. The attention weight between nodes is given by a syntactic relation graph probability network (SR-GPR), which encodes the syntactic dependency between nodes and guides the graph attention mechanism to learn information about the dependency structure. The learned feature representation retains information about the node-to-node syntactic dependency, and can further discover global dependencies effectively. The experimental results demonstrate on a publicly available biomedical dataset that, our method achieves state-of-the-art performance while requiring significantly less computational resources. Specifically, in the "drug-mutation" relation extraction task, our method achieves an advanced accuracy of 93.78% for binary classification and 92.14% for multi-classification. In the "drug-gene-mutation" relation extraction task, our method achieves an advanced accuracy of 93.22% for binary classification and 92.28% for multi-classification. Across all relation extraction tasks, our method improves accuracy by an average of 0.49% compared to the existing best model. Furthermore, our method achieved an accuracy of 69.5% in text classification, surpassing most existing models, demonstrating its robustness in generalization across different domains without additional fine-tuning.


Assuntos
Pesquisa Biomédica , Idioma , Armazenamento e Recuperação da Informação
8.
BMC Bioinformatics ; 24(1): 144, 2023 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-37046202

RESUMO

Extraction of associations of singular nucleotide polymorphism (SNP) and phenotypes from biomedical literature is a vital task in BioNLP. Recently, some methods have been developed to extract mutation-diseases affiliations. However, no accessible method of extracting associations of SNP-phenotype from content considers their degree of certainty. In this paper, several machine learning methods were developed to extract ranked SNP-phenotype associations from biomedical abstracts and then were compared to each other. In addition, shallow machine learning methods, including random forest, logistic regression, and decision tree and two kernel-based methods like subtree and local context, a rule-based and a deep CNN-LSTM-based and two BERT-based methods were developed in this study to extract associations. Furthermore, the experiments indicated that although the used linguist features could be employed to implement a superior association extraction method outperforming the kernel-based counterparts, the used deep learning and BERT-based methods exhibited the best performance. However, the used PubMedBERT-LSTM outperformed the other developed methods among the used methods. Moreover, similar experiments were conducted to estimate the degree of certainty of the extracted association, which can be used to assess the strength of the reported association. The experiments revealed that our proposed PubMedBERT-CNN-LSTM method outperformed the sophisticated methods on the task.


Assuntos
Aprendizado de Máquina , Polimorfismo de Nucleotídeo Único , Fenótipo , Algoritmo Florestas Aleatórias , Mutação
9.
JMIR Med Inform ; 10(10): e41136, 2022 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-36264604

RESUMO

BACKGROUND: With the rapid expansion of biomedical literature, biomedical information extraction has attracted increasing attention from researchers. In particular, relation extraction between 2 entities is a long-term research topic. OBJECTIVE: This study aimed to perform 2 multiclass relation extraction tasks of Biomedical Natural Language Processing Workshop 2019 Open Shared Tasks: relation extraction of Bacteria-Biotope (BB-rel) task and binary relation extraction of plant seed development (SeeDev-binary) task. In essence, these 2 tasks are aimed at extracting the relation between annotated entity pairs from biomedical texts, which is a challenging problem. METHODS: Traditional research methods adopted feature- or kernel-based methods and achieved good performance. For these tasks, we propose a deep learning model based on a combination of several distributed features, such as domain-specific word embedding, part-of-speech embedding, entity-type embedding, distance embedding, and position embedding. The multi-head attention mechanism is used to extract the global semantic features of an entire sentence. Meanwhile, we introduced a dependency-type feature and the shortest dependency path connecting 2 candidate entities in the syntactic dependency graph to enrich the feature representation. RESULTS: Experiments show that our proposed model has excellent performance in biomedical relation extraction, achieving F1 scores of 65.56% and 38.04% on the test sets of the BB-rel and SeeDev-binary tasks. Especially in the SeeDev-binary task, the F1 score of our model is superior to that of other existing models and achieves state-of-the-art performance. CONCLUSIONS: We demonstrated that the multi-head attention mechanism can learn relevant syntactic and semantic features in different representation subspaces and different positions to extract comprehensive feature representation. Moreover, syntactic dependency features can improve the performance of the model by learning dependency relation between the entities in biomedical texts.

10.
J Biomed Inform ; 132: 104135, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35842217

RESUMO

Certain categories in multi-category biomedical relationship extraction have linguistic similarities to some extent. Keywords related to categories and syntax structures of samples between these categories have some notable features, which are very useful in biomedical relation extraction. The pre-trained model has been widely used and has achieved great success in biomedical relationship extraction, but it is still incapable of mining this kind of information accurately. To solve the problem, we present a syntax-enhanced model based on category keywords. First, we prune syntactic dependency trees in terms of category keywords obtained by the chi-square test. It reduces noisy information caused by current syntactic parsing tools and retains useful information related to categories. Next, to encode category-related syntactic dependency trees, a syntactic transformer is presented, which enhances the ability of the pre-trained model to capture syntax structures and to distinguish multiple categories. We evaluate our method on three biomedical datasets. Compared with state-of-the-art models, our method performs better on these datasets. We conduct further analysis to verify the effectiveness of our method.


Assuntos
Linguística
11.
BMC Bioinformatics ; 23(1): 111, 2022 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-35361129

RESUMO

BACKGROUND: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size-preventing models from scaling effectively to large amounts of data. RESULTS: To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the GDA was extracted, the corresponding GDA, and the information about the gene-disease pair. CONCLUSIONS: TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.


Assuntos
Mineração de Dados , Projetos de Pesquisa , Bases de Dados Factuais , Humanos
12.
BMC Bioinformatics ; 23(1): 120, 2022 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-35379166

RESUMO

BACKGROUND: Recently, automatically extracting biomedical relations has been a significant subject in biomedical research due to the rapid growth of biomedical literature. Since the adaptation to the biomedical domain, the transformer-based BERT models have produced leading results on many biomedical natural language processing tasks. In this work, we will explore the approaches to improve the BERT model for relation extraction tasks in both the pre-training and fine-tuning stages of its applications. In the pre-training stage, we add another level of BERT adaptation on sub-domain data to bridge the gap between domain knowledge and task-specific knowledge. Also, we propose methods to incorporate the ignored knowledge in the last layer of BERT to improve its fine-tuning. RESULTS: The experiment results demonstrate that our approaches for pre-training and fine-tuning can improve the BERT model performance. After combining the two proposed techniques, our approach outperforms the original BERT models with averaged F1 score improvement of 2.1% on relation extraction tasks. Moreover, our approach achieves state-of-the-art performance on three relation extraction benchmark datasets. CONCLUSIONS: The extra pre-training step on sub-domain data can help the BERT model generalization on specific tasks, and our proposed fine-tuning mechanism could utilize the knowledge in the last layer of BERT to boost the model performance. Furthermore, the combination of these two approaches further improves the performance of BERT model on the relation extraction tasks.


Assuntos
Pesquisa Biomédica , Mineração de Dados , Pesquisa Biomédica/métodos , Mineração de Dados/métodos , Fontes de Energia Elétrica , Processamento de Linguagem Natural , Publicações
13.
BMC Bioinformatics ; 23(1): 20, 2022 Jan 06.
Artigo em Inglês | MEDLINE | ID: mdl-34991458

RESUMO

BACKGROUND: In biomedical research, chemical and disease relation extraction from unstructured biomedical literature is an essential task. Effective context understanding and knowledge integration are two main research problems in this task. Most work of relation extraction focuses on classification for entity mention pairs. Inspired by the effectiveness of machine reading comprehension (RC) in the respect of context understanding, solving biomedical relation extraction with the RC framework at both intra-sentential and inter-sentential levels is a new topic worthy to be explored. Except for the unstructured biomedical text, many structured knowledge bases (KBs) provide valuable guidance for biomedical relation extraction. Utilizing knowledge in the RC framework is also worthy to be investigated. We propose a knowledge-enhanced reading comprehension (KRC) framework to leverage reading comprehension and prior knowledge for biomedical relation extraction. First, we generate questions for each relation, which reformulates the relation extraction task to a question answering task. Second, based on the RC framework, we integrate knowledge representation through an efficient knowledge-enhanced attention interaction mechanism to guide the biomedical relation extraction. RESULTS: The proposed model was evaluated on the BioCreative V CDR dataset and CHR dataset. Experiments show that our model achieved a competitive document-level F1 of 71.18% and 93.3%, respectively, compared with other methods. CONCLUSION: Result analysis reveals that open-domain reading comprehension data and knowledge representation can help improve biomedical relation extraction in our proposed KRC framework. Our work can encourage more research on bridging reading comprehension and biomedical relation extraction and promote the biomedical relation extraction.


Assuntos
Pesquisa Biomédica , Compreensão , Bases de Conhecimento , Idioma
14.
IEEE Int Conf Healthc Inform ; 2022: 608-609, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37664001

RESUMO

Biomedical relation extraction plays a critical role in the construction of high-quality knowledge graphs and databases, which can further support many downstream applications. Pre-trained prompt tuning, as a new paradigm, has shown great potential in many natural language processing (NLP) tasks. Through inserting a piece of text into the original input, prompt converts NLP tasks into masked language problems, which could be better addressed by pre-trained language models (PLMs). In this study, we applied pre-trained prompt tuning to chemical-protein relation extraction using the BioCreative VI CHEMPROT dataset. The experiment results showed that the pre-trained prompt tuning outperformed the baseline approach in chemical-protein interaction classification. We conclude that the prompt tuning can improve the efficiency of the PLMs on chemical-protein relation extraction tasks.

15.
J Am Med Inform Assoc ; 28(12): 2571-2581, 2021 11 25.
Artigo em Inglês | MEDLINE | ID: mdl-34524450

RESUMO

OBJECTIVE: There have been various methods to deal with the erroneous training data in distantly supervised relation extraction (RE), however, their performance is still far from satisfaction. We aimed to deal with the insufficient modeling problem on instance-label correlations for predicting biomedical relations using deep learning and reinforcement learning. MATERIALS AND METHODS: In this study, a new computational model called piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) was proposed to perform RE on distantly supervised data generated from Unified Medical Language System with MEDLINE abstracts and benchmark datasets. In PACNN+RL, PACNN was introduced to encode semantic information of biomedical text, and the RL method with memory backtracking mechanism was leveraged to alleviate the erroneous data issue. Extensive experiments were conducted on 4 biomedical RE tasks. RESULTS: The proposed PACNN+RL model achieved competitive performance on 8 biomedical corpora, outperforming most baseline systems. Specifically, PACNN+RL outperformed all baseline methods with the F1-score of 0.5592 on the may-prevent dataset, 0.6666 on the may-treat dataset, and 0.3838 on the DDI corpus, 2011. For the protein-protein interaction RE task, we obtained new state-of-the-art performance on 4 out of 5 benchmark datasets. CONCLUSIONS: The performance on many distantly supervised biomedical RE tasks was substantially improved, primarily owing to the denoising effect of the proposed model. It is anticipated that PACNN+RL will become a useful tool for large-scale RE and other downstream tasks to facilitate biomedical knowledge acquisition. We also made the demonstration program and source code publicly available at http://112.74.48.115:9000/.


Assuntos
Redes Neurais de Computação , Unified Medical Language System , Semântica
16.
J Biomed Semantics ; 12(1): 16, 2021 08 18.
Artigo em Inglês | MEDLINE | ID: mdl-34407869

RESUMO

BACKGROUND: Transfer learning aims at enhancing machine learning performance on a problem by reusing labeled data originally designed for a related, but distinct problem. In particular, domain adaptation consists for a specific task, in reusing training data developedfor the same task but a distinct domain. This is particularly relevant to the applications of deep learning in Natural Language Processing, because they usually require large annotated corpora that may not exist for the targeted domain, but exist for side domains. RESULTS: In this paper, we experiment with transfer learning for the task of relation extraction from biomedical texts, using the TreeLSTM model. We empirically show the impact of TreeLSTM alone and with domain adaptation by obtaining better performances than the state of the art on two biomedical relation extraction tasks and equal performances for two others, for which little annotated data are available. Furthermore, we propose an analysis of the role that syntactic features may play in transfer learning for relation extraction. CONCLUSION: Given the difficulty to manually annotate corpora in the biomedical domain, the proposed transfer learning method offers a promising alternative to achieve good relation extraction performances for domains associated with scarce resources. Also, our analysis illustrates the importance that syntax plays in transfer learning, underlying the importance in this domain to privilege approaches that embed syntactic features.


Assuntos
Aprendizado de Máquina , Processamento de Linguagem Natural
17.
BMC Bioinformatics ; 21(1): 188, 2020 May 14.
Artigo em Inglês | MEDLINE | ID: mdl-32410573

RESUMO

BACKGROUND: In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS: A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS: SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.


Assuntos
Algoritmos , Armazenamento e Recuperação da Informação , Semântica , Humanos , Processamento de Linguagem Natural , PubMed , Unified Medical Language System
18.
JMIR Med Inform ; 8(7): e17638, 2020 Jul 31.
Artigo em Inglês | MEDLINE | ID: mdl-32459636

RESUMO

BACKGROUND: Automatically extracting relations between chemicals and diseases plays an important role in biomedical text mining. Chemical-disease relation (CDR) extraction aims at extracting complex semantic relationships between entities in documents, which contain intrasentence and intersentence relations. Most previous methods did not consider dependency syntactic information across the sentences, which are very valuable for the relations extraction task, in particular, for extracting the intersentence relations accurately. OBJECTIVE: In this paper, we propose a novel end-to-end neural network based on the graph convolutional network (GCN) and multihead attention, which makes use of the dependency syntactic information across the sentences to improve CDR extraction task. METHODS: To improve the performance of intersentence relation extraction, we constructed a document-level dependency graph to capture the dependency syntactic information across sentences. GCN is applied to capture the feature representation of the document-level dependency graph. The multihead attention mechanism is employed to learn the relatively important context features from different semantic subspaces. To enhance the input representation, the deep context representation is used in our model instead of traditional word embedding. RESULTS: We evaluate our method on CDR corpus. The experimental results show that our method achieves an F-measure of 63.5%, which is superior to other state-of-the-art methods. In the intrasentence level, our method achieves a precision, recall, and F-measure of 59.1%, 81.5%, and 68.5%, respectively. In the intersentence level, our method achieves a precision, recall, and F-measure of 47.8%, 52.2%, and 49.9%, respectively. CONCLUSIONS: The GCN model can effectively exploit the across sentence dependency information to improve the performance of intersentence CDR extraction. Both the deep context representation and multihead attention are helpful in the CDR extraction task.

19.
BMC Bioinformatics ; 20(1): 403, 2019 Jul 22.
Artigo em Inglês | MEDLINE | ID: mdl-31331263

RESUMO

BACKGROUND: Automatically understanding chemical-disease relations (CDRs) is crucial in various areas of biomedical research and health care. Supervised machine learning provides a feasible solution to automatically extract relations between biomedical entities from scientific literature, its success, however, heavily depends on large-scale biomedical corpora manually annotated with intensive labor and tremendous investment. RESULTS: We present an attention-based distant supervision paradigm for the BioCreative-V CDR extraction task. Training examples at both intra- and inter-sentence levels are generated automatically from the Comparative Toxicogenomics Database (CTD) without any human intervention. An attention-based neural network and a stacked auto-encoder network are applied respectively to induce learning models and extract relations at both levels. After merging the results of both levels, the document-level CDRs can be finally extracted. It achieves the precision/recall/F1-score of 60.3%/73.8%/66.4%, outperforming the state-of-the-art supervised learning systems without using any annotated corpus. CONCLUSION: Our experiments demonstrate that distant supervision is promising for extracting chemical disease relations from biomedical literature, and capturing both local and global attention features simultaneously is effective in attention-based distantly supervised learning.


Assuntos
Algoritmos , Doença , Aprendizado de Máquina Supervisionado , Toxicogenética , Bases de Dados como Assunto , Bases de Dados Factuais , Humanos , Redes Neurais de Computação , Fluxo de Trabalho
20.
Brief Bioinform ; 18(1): 160-178, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-26851224

RESUMO

Research on extracting biomedical relations has received growing attention recently, with numerous biological and clinical applications including those in pharmacogenomics, clinical trial screening and adverse drug reaction detection. The ability to accurately capture both semantic and syntactic structures in text expressing these relations becomes increasingly critical to enable deep understanding of scientific papers and clinical narratives. Shared task challenges have been organized by both bioinformatics and clinical informatics communities to assess and advance the state-of-the-art research. Significant progress has been made in algorithm development and resource construction. In particular, graph-based approaches bridge semantics and syntax, often achieving the best performance in shared tasks. However, a number of problems at the frontiers of biomedical relation extraction continue to pose interesting challenges and present opportunities for great improvement and fruitful research. In this article, we place biomedical relation extraction against the backdrop of its versatile applications, present a gentle introduction to its general pipeline and shared resources, review the current state-of-the-art in methodology advancement, discuss limitations and point out several promising future directions.


Assuntos
Semântica , Algoritmos , Biologia Computacional , Mineração de Dados , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA