RESUMEN
In recent years, knowledge graphs (KGs) have gained a great deal of popularity as a tool for storing relationships between entities and for performing higher level reasoning. KGs in biomedicine and clinical practice aim to provide an elegant solution for diagnosing and treating complex diseases more efficiently and flexibly. Here, we provide a systematic review to characterize the state-of-the-art of KGs in the area of complex disease research. We cover the following topics: (1) knowledge sources, (2) entity extraction methods, (3) relation extraction methods and (4) the application of KGs in complex diseases. As a result, we offer a complete picture of the domain. Finally, we discuss the challenges in the field by identifying gaps and opportunities for further research and propose potential research directions of KGs for complex disease diagnosis and treatment.
Asunto(s)
Reconocimiento de Normas Patrones AutomatizadasRESUMEN
Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Asunto(s)
Glioblastoma , Medicina de Precisión , Humanos , Aprendizaje Automático , Reino UnidoRESUMEN
Biomedical Named Entity Recognition (BioNER) is one of the most basic tasks in biomedical text mining, which aims to automatically identify and classify biomedical entities in text. Recently, deep learning-based methods have been applied to Biomedical Named Entity Recognition and have shown encouraging results. However, many biological entities are polysemous and ambiguous, which is one of the main obstacles to the task of biomedical named entity recognition. Deep learning methods require large amounts of training data, so the lack of data also affect the performance of model recognition. To solve the problem of polysemous words and insufficient data, for the task of biomedical named entity recognition, we propose a multi-task learning framework fused with language model based on the BiLSTM-CRF architecture. Our model uses a language model to design a differential encoding of the context, which could obtain dynamic word vectors to distinguish words in different datasets. Moreover, we use a multi-task learning method to collectively share the dynamic word vector of different types of entities to improve the recognition performance of each type of entity. Experimental results show that our model reduces the false positives caused by polysemous words through differentiated coding, and improves the performance of each subtask by sharing information between different entity data. Compared with other state-of-the art methods, our model achieved superior results in four typical training sets, and achieved the best results in F1 values.
Asunto(s)
Minería de Datos , Aprendizaje Profundo , Minería de Datos/métodos , Humanos , Procesamiento de Lenguaje Natural , Redes Neurales de la Computación , LenguajeRESUMEN
Biomedical event causal relation extraction (BECRE), as a subtask of biomedical information extraction, aims to extract event causal relation facts from unstructured biomedical texts and plays an essential role in many downstream tasks. The existing works have two main problems: i) Only shallow features are limited in helping the model establish potential relationships between biomedical events. ii) Using the traditional oversampling method to solve the data imbalance problem of the BECRE tasks ignores the requirements for data diversifying. This paper proposes a novel biomedical event causal relation extraction method to solve the above problems using deep knowledge fusion and Roberta-based data augmentation. To address the first problem, we fuse deep knowledge, including structural event representation and entity relation path, for establishing potential semantic connections between biomedical events. We use the Graph Convolutional Neural network (GCN) and the predicated tensor model to acquire structural event representation, and entity relation paths are encoded based on the external knowledge bases (GTD, CDR, CHR, GDA and UMLS). We introduce the triplet attention mechanism to fuse structural event representation and entity relation path information. Besides, this paper proposes the Roberta-based data augmentation method to address the second problem, some words of biomedical text, except biomedical events, are masked proportionally and randomly, and then pre-trained Roberta generates data instances for the imbalance BECRE dataset. Extensive experimental results on Hahn-Powell's and BioCause datasets confirm that the proposed method achieves state-of-the-art performance compared to current advances.
RESUMEN
PURPOSE: The expansion of research across various disciplines has led to a substantial increase in published papers and journals, highlighting the necessity for reliable text mining platforms for database construction and knowledge acquisition. This abstract introduces GPDMiner(Gene, Protein, and Disease Miner), a platform designed for the biomedical domain, addressing the challenges posed by the growing volume of academic papers. METHODS: GPDMiner is a text mining platform that utilizes advanced information retrieval techniques. It operates by searching PubMed for specific queries, extracting and analyzing information relevant to the biomedical field. This system is designed to discern and illustrate relationships between biomedical entities obtained from automated information extraction. RESULTS: The implementation of GPDMiner demonstrates its efficacy in navigating the extensive corpus of biomedical literature. It efficiently retrieves, extracts, and analyzes information, highlighting significant connections between genes, proteins, and diseases. The platform also allows users to save their analytical outcomes in various formats, including Excel and images. CONCLUSION: GPDMiner offers a notable additional functionality among the array of text mining tools available for the biomedical field. This tool presents an effective solution for researchers to navigate and extract relevant information from the vast unstructured texts found in biomedical literature, thereby providing distinctive capabilities that set it apart from existing methodologies. Its application is expected to greatly benefit researchers in this domain, enhancing their capacity for knowledge discovery and data management.
Asunto(s)
Manejo de Datos , Minería de Datos , Bases de Datos Factuales , Descubrimiento del Conocimiento , PubMedRESUMEN
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
Asunto(s)
Algoritmos , Aprendizaje Profundo , Enzimas , Procesamiento de Lenguaje Natural , Anotación de Secuencia Molecular/métodos , Humanos , Minería de Datos/métodosRESUMEN
Morphology, crystal phase, and its transformation are important structures that frequently determine electrocatalytic activity, but the correlations of intrinsic activity with them are not completely understood. Herein, using Co(OH)2 micro-platelets with well-defined structures (phase, thickness, area, and volume) as model electrocatalysts of oxygen evolution reaction, multiple in situ microscopy is combined to correlate the electrocatalytic activity with morphology, phase, and its transformation. Single-entity morphology and electrochemistry characterized by atomic force microscopy and scanning electrochemical cell microscopy reveal a thickness-dependent turnover frequency (TOF) of α-Co(OH)2. The TOF (≈9.5 s-1) of α-Co(OH)2 with ≈14 nm thickness is ≈95-fold higher than that (≈0.1 s-1) with ≈80 nm. Moreover, this thickness-dependent activity has a critical thickness of ≈30 nm, above which no thickness-dependence is observed. Contrarily, ß-Co(OH)2 reveals a lower TOF (≈0.1 s-1) having no significant correlation with thickness. Combining single-entity electrochemistry with in situ Raman microspectroscopy, this thickness-dependent activity is explained by more reversible Co3+/Co2+ kinetics and larger ratio of active Co sites of thinner α-Co(OH)2, accompanied with faster phase transformation and more extensive surface restructuration. The findings highlight the interactions among thickness, ratio of active sites, kinetics of active sites, and phase transformation, and offer new insights into structure-activity relationships at single-entity level.
RESUMEN
Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of 'near-drug' properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords 'A61P, compound, structure' in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97-98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge.
Asunto(s)
Minería de Datos , Bases de Datos de Compuestos Químicos , Minería de Datos/métodos , Bases de Datos Factuales , Descubrimiento de DrogasRESUMEN
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Asunto(s)
Algoritmos , Minería de Datos , Proteínas , PubMedRESUMEN
The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.
Asunto(s)
Aprendizaje Profundo , Minería de Datos/métodosRESUMEN
Despite global efforts on meeting sustainable development goals by 2030, persistent and widespread sanitation deficits in rural, underserved communities in high-income countriesâincluding the United States (US)âchallenge achieving this target. The recent US federal infrastructure funding, coupled with research efforts to explore innovative, alternative decentralized wastewater systems, are unprecedented opportunities for addressing basic sanitation gaps in these communities. Yet, understanding how to best manage these systems for sustainable operations and maintenance (O&M) is still a national need. Here, we develop an integrated management approach for achieving such sustainable systems, taking into account the utility structure, operational aspects, and possible barriers impeding effective management of decentralized wastewater infrastructure. We demonstrate this approach through a binomial logistic regression of survey responses from 114 public and private management entities (e.g., water and sewer utilities) operating in 27 states in the US, targeting the rural Alabama Black Belt wastewater issues. Our assessment introduces policy areas that support sustainable decentralized wastewater systems management and operations, including privatizing water-wastewater infrastructure systems, incentivizing/mandating the consolidation of utility management of these systems, federally funding the O&M, and developing and retaining water-wastewater workforce in rural, underserved communities. Our discussions give rise to a holistic empirical understanding of effective management of decentralized wastewater infrastructure for rural, underserved communities in the US, thereby contributing to global conversations on sustainable development.
Asunto(s)
Población Rural , Desarrollo Sostenible , Aguas Residuales , Alabama , Humanos , SaneamientoRESUMEN
BACKGROUND: Tumor embolism is a very rare primary manifestation of cancers and the diagnosis is challenging, especially if located in the pulmonary arteries, where it can mimic nonmalignant pulmonary embolism. Intimal sarcoma is one of the least commonly reported primary tumors of vessels with only a few cases reported worldwide. A typical location of this malignancy is the pulmonary artery. Herein, we present a case report of an intimal sarcoma with primary manifestation in the pulmonary arteries. A 53-year-old male initially presented with dyspnea. On imaging, a pulmonary artery embolism was detected and was followed by thrombectomy of the right ventricular outflow tract, main pulmonary artery trunk, and right pulmonary artery after ineffective lysis therapy. Complementary imaging of the chest and abdomen including a PET-CT scan demonstrated no evidence of a primary tumor. Subsequent pathology assessment suggested an intimal sarcoma further confirmed by DNA methylation based molecular analysis. We initiated adjuvant chemotherapy with doxorubicin. Four months after the completion of adjuvant therapy a follow-up scan revealed a local recurrence without distant metastases. DISCUSSION: Primary pulmonary artery intimal sarcoma (PAS) is an exceedingly rare entity and pathological diagnosis remains challenging. Therefore, the detection of entity-specific molecular alterations is a supporting argument in the diagnostic spectrum. Complete surgical resection is the prognostically most important treatment for intimal cardiac sarcomas. Despite adjuvant chemotherapy, the prognosis of cardiac sarcomas remains very poor. This case of a PAS highlights the difficulty in establishing a diagnosis and the aggressive natural course of the disease. CONCLUSION: In case of atypical presentation of a pulmonary embolism, a tumor originating from the great vessels should be considered. Molecular pathology techniques support in establishing a reliable diagnosis.
Asunto(s)
Arteria Pulmonar , Sarcoma , Trombosis , Humanos , Masculino , Persona de Mediana Edad , Arteria Pulmonar/patología , Sarcoma/diagnóstico , Sarcoma/patología , Túnica Íntima/patología , Neoplasias Vasculares/diagnóstico , Neoplasias Vasculares/patología , Embolia Pulmonar/diagnóstico , Diagnóstico DiferencialRESUMEN
OBJECTIVE: Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities. METHODS: To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence. RESULTS: Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER. CONCLUSION: The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.
Asunto(s)
Procesamiento de Lenguaje Natural , PubMed , Humanos , Algoritmos , Minería de Datos/métodos , Semántica , Informática Médica/métodosRESUMEN
OBJECTIVE: Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS: We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS: We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION: This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
RESUMEN
OBJECTIVE: As new knowledge is produced at a rapid pace in the biomedical field, existing biomedical Knowledge Graphs (KGs) cannot be manually updated in a timely manner. Previous work in Natural Language Processing (NLP) has leveraged link prediction to infer the missing knowledge in general-purpose KGs. Inspired by this, we propose to apply link prediction to existing biomedical KGs to infer missing knowledge. Although Knowledge Graph Embedding (KGE) methods are effective in link prediction tasks, they are less capable of capturing relations between communities of entities with specific attributes (Fanourakis et al., 2023). METHODS: To address this challenge, we proposed an entity distance-based method for abstracting a Community Knowledge Graph (CKG) from a simplified version of the pre-existing PubMed Knowledge Graph (PKG) (Xu et al., 2020). For link prediction on the abstracted CKG, we proposed an extension approach for the existing KGE models by linking the information in the PKG to the abstracted CKG. The applicability of this extension was proved by employing six well-known KGE models: TransE, TransH, DistMult, ComplEx, SimplE, and RotatE. Evaluation metrics including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@k were used to assess the link prediction performance. In addition, we presented a backtracking process that traces the results of CKG link prediction back to the PKG scale for further comparison. RESULTS: Six different CKGs were abstracted from the PKG by using embeddings of the six KGE methods. The results of link prediction in these abstracted CKGs indicate that our proposed extension can improve the existing KGE methods, achieving a top-10 accuracy of 0.69 compared to 0.5 for TransE, 0.7 compared to 0.54 for TransH, 0.67 compared to 0.6 for DistMult, 0.73 compared to 0.57 for ComplEx, 0.73 compared to 0.63 for SimplE, and 0.85 compared to 0.76 for RotatE on their CKGs, respectively. These improved performances also highlight the wide applicability of the extension approach. CONCLUSION: This study proposed novel insights into abstracting CKGs from the PKG. The extension approach indicated enhanced performance of the existing KGE methods and has applicability. As an interesting future extension, we plan to conduct link prediction for entities that are newly introduced to the PKG.
Asunto(s)
Procesamiento de Lenguaje Natural , PubMed , Algoritmos , Humanos , Minería de Datos/métodos , Bases del ConocimientoRESUMEN
Electronic health records (EHRs) have been widely used and are gradually replacing paper records. Therefore, extracting valuable information from EHRs has become the focus and hotspot of current research. Clinical named entity recognition (CNER) is an important task in information extraction. Most current research methods used standard supervised learning approaches to fine-tune pre-trained language models (PLMs), which require a large amount of annotated data for model training. However, in realistic medical scenarios, annotated data are scarce, especially in the healthcare field. The process of annotating data in real clinical settings is time-consuming and labour-intensive. In this paper, a language inference-based learning method (LANGIL) is proposed to study clinical NER tasks with limited annotated samples, i.e., in low-resource clinical scenarios. A method based on prompt learning is designed to reformulate the entity recognition task into a language inference-based task. Differing from the standard fine-tuning method, the approach introduced in this paper does not design the additional network layers that train from scratch. This alleviates the gap between pre-training tasks and downstream tasks, allowing the comprehension capabilities of PLMs to be leveraged under the condition of limited training samples. The experiments on four Chinese clinical named entity recognition datasets showed that LANGIL achieves significant improvements in F1-score compared to the former method.
Asunto(s)
Almacenamiento y Recuperación de la Información , Lenguaje , Caballos , Animales , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , ChinaRESUMEN
Document-level relation triplet extraction is crucial in biomedical text mining, aiding in drug discovery and the construction of biomedical knowledge graphs. Current language models face challenges in generalizing to unseen datasets and relation types in biomedical relation triplet extraction, which limits their effectiveness in these crucial tasks. To address this challenge, our study optimizes models from two critical dimensions: data-task relevance and granularity of relations, aiming to enhance their generalization capabilities significantly. We introduce a novel progressive learning strategy to obtain the PLRTE model. This strategy not only enhances the model's capability to comprehend diverse relation types in the biomedical domain but also implements a structured four-level progressive learning process through semantic relation augmentation, compositional instruction and dual-axis level learning. Our experiments on the DDI and BC5CDR document-level biomedical relation triplet datasets demonstrate a significant performance improvement of 5% to 20% over the current state-of-the-art baselines. Furthermore, our model exhibits exceptional generalization capabilities on the unseen Chemprot and GDA datasets, further validating the effectiveness of optimizing data-task association and relation granularity for enhancing model generalizability.
RESUMEN
BACKGROUND: An adverse drug event (ADE) is any unfavorable effect that occurs due to the use of a drug. Extracting ADEs from unstructured clinical notes is essential to biomedical text extraction research because it helps with pharmacovigilance and patient medication studies. OBJECTIVE: From the considerable amount of clinical narrative text, natural language processing (NLP) researchers have developed methods for extracting ADEs and their related attributes. This work presents a systematic review of current methods. METHODOLOGY: Two biomedical databases have been searched from June 2022 until December 2023 for relevant publications regarding this review, namely the databases PubMed and Medline. Similarly, we searched the multi-disciplinary databases IEEE Xplore, Scopus, ScienceDirect, and the ACL Anthology. We adopted the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement guidelines and recommendations for reporting systematic reviews in conducting this review. Initially, we obtained 5,537 articles from the search results from the various databases between 2015 and 2023. Based on predefined inclusion and exclusion criteria for article selection, 100 publications have undergone full-text review, of which we consider 82 for our analysis. RESULTS: We determined the general pattern for extracting ADEs from clinical notes, with named entity recognition (NER) and relation extraction (RE) being the dual tasks considered. Researchers that tackled both NER and RE simultaneously have approached ADE extraction as a "pipeline extraction" problem (n = 22), as a "joint task extraction" problem (n = 7), and as a "multi-task learning" problem (n = 6), while others have tackled only NER (n = 27) or RE (n = 20). We further grouped the reviews based on the approaches for data extraction, namely rule-based (n = 8), machine learning (n = 11), deep learning (n = 32), comparison of two or more approaches (n = 11), hybrid (n = 12) and large language models (n = 8). The most used datasets are MADE 1.0, TAC 2017 and n2c2 2018. CONCLUSION: Extracting ADEs is crucial, especially for pharmacovigilance studies and patient medications. This survey showcases advances in ADE extraction research, approaches, datasets, and state-of-the-art performance in them. Challenges and future research directions are highlighted. We hope this review will guide researchers in gaining background knowledge and developing more innovative ways to address the challenges.
Asunto(s)
Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Procesamiento de Lenguaje Natural , Humanos , Minería de Datos/métodos , Aprendizaje Automático , Farmacovigilancia , Registros Electrónicos de Salud , Bases de Datos Factuales , Aprendizaje ProfundoRESUMEN
OBJECTIVE: The primary objective of this review is to investigate the effectiveness of machine learning and deep learning methodologies in the context of extracting adverse drug events (ADEs) from clinical benchmark datasets. We conduct an in-depth analysis, aiming to compare the merits and drawbacks of both machine learning and deep learning techniques, particularly within the framework of named-entity recognition (NER) and relation classification (RC) tasks related to ADE extraction. Additionally, our focus extends to the examination of specific features and their impact on the overall performance of these methodologies. In a broader perspective, our research extends to ADE extraction from various sources, including biomedical literature, social media data, and drug labels, removing the limitation to exclusively machine learning or deep learning methods. METHODS: We conducted an extensive literature review on PubMed using the query "(((machine learning [Medical Subject Headings (MeSH) Terms]) OR (deep learning [MeSH Terms])) AND (adverse drug event [MeSH Terms])) AND (extraction)", and supplemented this with a snowballing approach to review 275 references sourced from retrieved articles. RESULTS: In our analysis, we included twelve articles for review. For the NER task, deep learning models outperformed machine learning models. In the RC task, gradient Boosting, multilayer perceptron and random forest models excelled. The Bidirectional Encoder Representations from Transformers (BERT) model consistently achieved the best performance in the end-to-end task. Future efforts in the end-to-end task should prioritize improving NER accuracy, especially for 'ADE' and 'Reason'. CONCLUSION: These findings hold significant implications for advancing the field of ADE extraction and pharmacovigilance, ultimately contributing to improved drug safety monitoring and healthcare outcomes.
Asunto(s)
Aprendizaje Profundo , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Humanos , Inteligencia Artificial , Farmacovigilancia , Benchmarking , Procesamiento de Lenguaje NaturalRESUMEN
Health mindsets refer to beliefs about the malleability (growth mindset) versus stability (fixed mindset) of physical health and have gained traction as a predictor of health beliefs and behaviors. Across two studies, we tested whether health mindsets were associated with avoiding personalized health risk information. In Study 2, we also tested whether conceptually-related constructs of internal and chance health locus of control, health self-efficacy, fatalism, and genetic determinism were associated with information avoidance. Health mindsets were manipulated in Study 1 (college students, n = 284; 79.58% female; Mage = 19.74) and measured in Study 2 (participants recruited through MTurk, n = 735; 42.04% female; Mage = 35.78). In both studies, participants viewed a prediabetes infographic and were informed they could learn their prediabetes risk by completing an online risk calculator. Behavioral obligation was also manipulated in both studies to test whether an additional behavioral requirement associated with learning one's risk would exacerbate any negative impact of health mindsets on avoidance rates. All participants then indicated their interest in learning their prediabetes risk (avoidance intentions) and decided whether to complete the online risk calculator (avoidance behavior). In Study 1, there was no impact of health mindsets, behavioral obligation, or their interaction on avoidance intentions or behavior. Study 2 similarly did not provide consistent evidence for an association of health mindsets, behavioral obligation, or their interaction with avoidance intentions or behavior. However, in Study 2, internal health locus of control was consistently associated with both intentions and behavior. Health information avoidance may be a barrier to prevention and early detection of disease. To encourage individuals to learn potentially important health information, public health interventions might seek to increase people's beliefs that their own actions play a role in their health outcomes. Interventions may also seek to increase people's knowledge about and skills regarding improving their health outcomes, which may influence health locus of control beliefs.