Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
Bioinformatics ; 39(4)2023 04 03.
Artigo em Inglês | MEDLINE | ID: mdl-37004189

RESUMO

MOTIVATION: This article describes NEREL-BIO-an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. RESULTS: NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension models and report their results. AVAILABILITY AND IMPLEMENTATION: The dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.


Assuntos
Processamento de Linguagem Natural , Semântica , PubMed , Idioma
2.
J Biomed Inform ; 149: 104555, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38008241

RESUMO

The COVID-19 pandemic has sparked numerous discussions on social media platforms, with users sharing their views on topics such as mask-wearing and vaccination. To facilitate the evaluation of neural models for stance detection and premise classification, we organized the Social Media Mining for Health (SMM4H) 2022 Shared Task 2. This competition utilized manually annotated posts on three COVID-19-related topics: school closures, stay-at-home orders, and wearing masks. In this paper, we extend the previous work and present newly collected data on vaccination from Twitter to assess the performance of models on a different topic. To enhance the accuracy and effectiveness of our evaluation, we employed various strategies to aggregate tweet texts with claims, including models with feature-level (early) fusion and dual-view architectures from the SMM4H 2022 Task 2 leaderboard. Our primary objective was to create a valuable dataset and perform an extensive experimental evaluation to support future research in argument mining in the health domain.


Assuntos
COVID-19 , Mídias Sociais , Humanos , Pandemias , Mineração de Dados , Coleta de Dados
3.
Bioinformatics ; 37(21): 3856-3864, 2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34213526

RESUMO

MOTIVATION: Clinical trials are the essential stage of every drug development program for the treatment to become available to patients. Despite the importance of well-structured clinical trial databases and their tremendous value for drug discovery and development such instances are very rare. Presently large-scale information on clinical trials is stored in clinical trial registers which are relatively structured, but the mappings to external databases of drugs and diseases are increasingly lacking. The precise production of such links would enable us to interrogate richer harmonized datasets for invaluable insights. RESULTS: We present a neural approach for medical concept normalization of diseases and drugs. Our two-stage approach is based on Bidirectional Encoder Representations from Transformers (BERT). In the training stage, we optimize the relative similarity of mentions and concept names from a terminology via triplet loss. In the inference stage, we obtain the closest concept name representation in a common embedding space to a given mention representation. We performed a set of experiments on a dataset of abstracts and a real-world dataset of trial records with interventions and conditions mapped to drug and disease terminologies. The latter includes mentions associated with one or more concepts (in-KB) or zero (out-of-KB, nil prediction). Experiments show that our approach significantly outperforms baseline and state-of-the-art architectures. Moreover, we demonstrate that our approach is effective in knowledge transfer from the scientific literature to clinical trial data. AVAILABILITY AND IMPLEMENTATION: We make code and data freely available at https://github.com/insilicomedicine/DILBERT.


Assuntos
Desenvolvimento de Medicamentos , Publicações , Humanos , Descoberta de Drogas
4.
Bioinformatics ; 37(2): 243-249, 2021 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-32722774

RESUMO

MOTIVATION: Drugs and diseases play a central role in many areas of biomedical research and healthcare. Aggregating knowledge about these entities across a broader range of domains and languages is critical for information extraction (IE) applications. To facilitate text mining methods for analysis and comparison of patient's health conditions and adverse drug reactions reported on the Internet with traditional sources such as drug labels, we present a new corpus of Russian language health reviews. RESULTS: The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labeled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labeled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labeled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multilabel sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data. AVAILABILITY AND IMPLEMENTATION: We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Preparações Farmacêuticas , Mineração de Dados , Humanos , Idioma , Federação Russa
5.
J Biomed Inform ; 135: 104182, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36184069

RESUMO

In this paper, we focus on the classification of tweets as sources of potential signals for adverse drug effects (ADEs) or drug reactions (ADRs). Following the intuition that text and drug structure representations are complementary, we introduce a multimodal model with two components. These components are state-of-the-art BERT-based models for language understanding and molecular property prediction. Experiments were carried out on multilingual benchmarks of the Social Media Mining for Health Research and Applications (#SMM4H) initiative. Our models obtained state-of-the-art results of 0.61 F1-measure and 0.57 F1-measure on #SMM4H 2021 Shared Tasks 1a and 2 in English and Russian, respectively. On the classification of French tweets from SMM4H 2020 Task 1, our approach pushes the state of the art by an absolute gain of 8% F1. Our experiments show that the molecular information obtained from neural networks is more beneficial for ADE classification than traditional molecular descriptors. The source code for our models is freely available at https://github.com/Andoree/smm4h_2021_classification.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Mídias Sociais , Humanos , Mineração de Dados/métodos , Processamento de Linguagem Natural , Redes Neurais de Computação
6.
J Biomed Inform ; 103: 103382, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-32028051

RESUMO

Relation extraction aims to discover relational facts about entity mentions from plain texts. In this work, we focus on clinical relation extraction; namely, given a medical record with mentions of drugs and their attributes, we identify relations between these entities. We propose a machine learning model with a novel set of knowledge-based and BioSentVec embedding features. We systematically investigate the impact of these features with standard distance- and word-based features, conducting experiments on two benchmark datasets of clinical texts from MADE 2018 and n2c2 2018 shared tasks. For comparison with the feature-based model, we utilize state-of-the-art models and three BERT-based models, including BioBERT and Clinical BERT. Our results demonstrate that distance and word features provide significant benefits to the classifier. Knowledge-based features improve classification results only for particular types of relations. The sentence embedding feature provides the largest improvement in results, among other explored features on the MADE corpus. The classifier obtains state-of-the-art performance in clinical relation extraction with F-measure of 92.6%, improving F-measure by 3.5% on the MADE corpus.


Assuntos
Bases de Conhecimento , Aprendizado de Máquina , Idioma , Processamento de Linguagem Natural
7.
J Biomed Inform ; 84: 93-102, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-29906585

RESUMO

Text mining of scientific libraries and social media has already proven itself as a reliable tool for drug repurposing and hypothesis generation. The task of mapping a disease mention to a concept in a controlled vocabulary, typically to the standard thesaurus in the Unified Medical Language System (UMLS), is known as medical concept normalization. This task is challenging due to the differences in the use of medical terminology between health care professionals and social media texts coming from the lay public. To bridge this gap, we use sequence learning with recurrent neural networks and semantic representation of one- or multi-word expressions: we develop end-to-end architectures directly tailored to the task, including bidirectional Long Short-Term Memory, Gated Recurrent Units with an attention mechanism, and additional semantic similarity features based on UMLS. Our evaluation against a standard benchmark shows that recurrent neural networks improve results over an effective baseline for classification based on convolutional neural networks. A qualitative examination of mentions discovered in a dataset of user reviews collected from popular online health information platforms as well as a quantitative evaluation both show improvements in the semantic representation of health-related expressions in social media.


Assuntos
Mineração de Dados/métodos , Informática Médica/métodos , Processamento de Linguagem Natural , Redes Neurais de Computação , Mídias Sociais , Unified Medical Language System , Linguística , Preparações Farmacêuticas , Probabilidade , Semântica , Rede Social
8.
Chem Sci ; 15(22): 8380-8389, 2024 Jun 05.
Artigo em Inglês | MEDLINE | ID: mdl-38846388

RESUMO

Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.

9.
Clin Pharmacol Ther ; 114(5): 972-980, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37483175

RESUMO

Drug discovery and development is a notoriously risky process with high failure rates at every stage, including disease modeling, target discovery, hit discovery, lead optimization, preclinical development, human safety, and efficacy studies. Accurate prediction of clinical trial outcomes may help significantly improve the efficiency of this process by prioritizing therapeutic programs that are more likely to succeed in clinical trials and ultimately benefit patients. Here, we describe inClinico, a transformer-based artificial intelligence software platform designed to predict the outcome of phase II clinical trials. The platform combines an ensemble of clinical trial outcome prediction engines that leverage generative artificial intelligence and multimodal data, including omics, text, clinical trial design, and small molecule properties. inClinico was validated in retrospective, quasi-prospective, and prospective validation studies internally and with pharmaceutical companies and financial institutions. The platform achieved 0.88 receiver operating characteristic area under the curve in predicting the phase II to phase III transition on a quasi-prospective validation dataset. The first prospective predictions were made and placed on date-stamped preprint servers in 2016. To validate our model in a real-world setting, we published forecasted outcomes for several phase II clinical trials achieving 79% accuracy for the trials that have read out. We also present an investment application of inClinico using date stamped virtual trading portfolio demonstrating 35% 9-month return on investment.

10.
J Am Med Inform Assoc ; 28(10): 2184-2192, 2021 09 18.
Artigo em Inglês | MEDLINE | ID: mdl-34270701

RESUMO

OBJECTIVE: Research on pharmacovigilance from social media data has focused on mining adverse drug events (ADEs) using annotated datasets, with publications generally focusing on 1 of 3 tasks: ADE classification, named entity recognition for identifying the span of ADE mentions, and ADE mention normalization to standardized terminologies. While the common goal of such systems is to detect ADE signals that can be used to inform public policy, it has been impeded largely by limited end-to-end solutions for large-scale analysis of social media reports for different drugs. MATERIALS AND METHODS: We present a dataset for training and evaluation of ADE pipelines where the ADE distribution is closer to the average 'natural balance' with ADEs present in about 7% of the tweets. The deep learning architecture involves an ADE extraction pipeline with individual components for all 3 tasks. RESULTS: The system presented achieved state-of-the-art performance on comparable datasets and scored a classification performance of F1 = 0.63, span extraction performance of F1 = 0.44 and an end-to-end entity resolution performance of F1 = 0.34 on the presented dataset. DISCUSSION: The performance of the models continues to highlight multiple challenges when deploying pharmacovigilance systems that use social media data. We discuss the implications of such models in the downstream tasks of signal detection and suggest future enhancements. CONCLUSION: Mining ADEs from Twitter posts using a pipeline architecture requires the different components to be trained and tuned based on input data imbalance in order to ensure optimal performance on the end-to-end resolution task.


Assuntos
Aprendizado Profundo , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Mídias Sociais , Humanos , Farmacovigilância
11.
Epidemiologia (Basel) ; 2(3): 315-324, 2021 Aug 05.
Artigo em Inglês | MEDLINE | ID: mdl-36417228

RESUMO

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.

12.
ArXiv ; 2020 Nov 13.
Artigo em Inglês | MEDLINE | ID: mdl-32550247

RESUMO

As the COVID-19 pandemic continues its march around the world, an unprecedented amount of open data is being generated for genetics and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated in the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique world-wide event into biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing. This open dataset will allow researchers to conduct a number of research projects relating to the emotional and mental responses to social distancing measures, the identification of sources of misinformation, and the stratified measurement of sentiment towards the pandemic in near real time.

13.
J Healthc Eng ; 2017: 9451342, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29177027

RESUMO

Adverse drug reactions (ADRs) are an essential part of the analysis of drug use, measuring drug use benefits, and making policy decisions. Traditional channels for identifying ADRs are reliable but very slow and only produce a small amount of data. Text reviews, either on specialized web sites or in general-purpose social networks, may lead to a data source of unprecedented size, but identifying ADRs in free-form text is a challenging natural language processing problem. In this work, we propose a novel model for this problem, uniting recurrent neural architectures and conditional random fields. We evaluate our model with a comprehensive experimental study, showing improvements over state-of-the-art methods of ADR extraction.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Armazenamento e Recuperação da Informação/métodos , Redes Neurais de Computação , Adulto , Algoritmos , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Processamento de Linguagem Natural , Farmacovigilância , Adulto Jovem
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa