Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 120
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 38(3): 872-874, 2022 01 12.
Artigo em Inglês | MEDLINE | ID: mdl-34636886

RESUMO

SUMMARY: Large-scale pre-trained language models (PLMs) have advanced state-of-the-art (SOTA) performance on various biomedical text mining tasks. The power of such PLMs can be combined with the advantages of deep generative models. These are examples of these combinations. However, they are trained only on general domain text, and biomedical models are still missing. In this work, we describe BioVAE, the first large-scale pre-trained latent variable language model for the biomedical domain, which uses the OPTIMUS framework to train on large volumes of biomedical text. The model shows SOTA performance on several biomedical text mining tasks when compared to existing publicly available biomedical PLMs. In addition, our model can generate more accurate biomedical sentences than the original OPTIMUS output. AVAILABILITY AND IMPLEMENTATION: Our source code and pre-trained models are freely available: https://github.com/aistairc/BioVAE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Mineração de Dados , Idioma , Software , Processamento de Linguagem Natural
2.
J Biomed Inform ; 141: 104347, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37030658

RESUMO

Automatic extraction of patient medication histories from free-text clinical notes can increase the amount of relevant information to clinicians for developing treatment plans. In addition to detecting medication events, clinical text mining systems must also be able to predict event context, such as negation, uncertainty, and time of occurrence, in order to construct accurate patient timelines. Towards this goal, we introduce Levitated Context Markers (LCMs), a novel transformer-based model for contextualized event extraction. LCMs are an adaptation of levitated markers -originally developed for relation extraction- that allow pretrained transformer models to utilize global input representations while also focusing on event-related subspans using a sparse attention mechanism. In addition to outperforming a strong baseline model on the Contextualized Medication Event Dataset, we show that LCMs' sparse attention can provide interpretable predictions by detecting relevant context cues in an unsupervised manner.


Assuntos
Mineração de Dados , Registros , Humanos , Processamento de Linguagem Natural
3.
BMC Bioinformatics ; 23(1): 211, 2022 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-35655127

RESUMO

BACKGROUND: Nested and overlapping events are particularly frequent and informative structures in biomedical event extraction. However, state-of-the-art neural models either neglect those structures during learning or use syntactic features and external tools to detect them. To overcome these limitations, this paper presents and compares two neural models: a novel EXhaustive Neural Network (EXNN) and a Search-Based Neural Network (SBNN) for detection of nested and overlapping events. RESULTS: We evaluate the proposed models as an event detection component in isolation and within a pipeline setting. Evaluation in several annotated biomedical event extraction datasets shows that both EXNN and SBNN achieve higher performance in detecting nested and overlapping events, compared to the state-of-the-art model Turku Event Extraction System (TEES). CONCLUSIONS: The experimental results reveal that both EXNN and SBNN are effective for biomedical event extraction. Furthermore, results on a pipeline setting indicate that our models improve detection of events compared to models that use either gold or predicted named entities.


Assuntos
Modelos Biológicos , Redes Neurais de Computação
4.
J Med Internet Res ; 24(10): e40323, 2022 10 05.
Artigo em Inglês | MEDLINE | ID: mdl-36150046

RESUMO

BACKGROUND: In recent years, the COVID-19 pandemic has brought great changes to public health, society, and the economy. Social media provide a platform for people to discuss health concerns, living conditions, and policies during the epidemic, allowing policymakers to use this content to analyze the public emotions and attitudes for decision-making. OBJECTIVE: The aim of this study was to use deep learning-based methods to understand public emotions on topics related to the COVID-19 pandemic in the United Kingdom through a comparative geolocation and text mining analysis on Twitter. METHODS: Over 500,000 tweets related to COVID-19 from 48 different cities in the United Kingdom were extracted, with the data covering the period of the last 2 years (from February 2020 to November 2021). We leveraged three advanced deep learning-based models for topic modeling to geospatially analyze the sentiment, emotion, and topics of tweets in the United Kingdom: SenticNet 6 for sentiment analysis, SpanEmo for emotion recognition, and combined topic modeling (CTM). RESULTS: We observed a significant change in the number of tweets as the epidemiological situation and vaccination situation shifted over the 2 years. There was a sharp increase in the number of tweets from January 2020 to February 2020 due to the outbreak of COVID-19 in the United Kingdom. Then, the number of tweets gradually declined as of February 2020. Moreover, with identification of the COVID-19 Omicron variant in the United Kingdom in November 2021, the number of tweets grew again. Our findings reveal people's attitudes and emotions toward topics related to COVID-19. For sentiment, approximately 60% of tweets were positive, 20% were neutral, and 20% were negative. For emotion, people tended to express highly positive emotions in the beginning of 2020, while expressing highly negative emotions over time toward the end of 2021. The topics also changed during the pandemic. CONCLUSIONS: Through large-scale text mining of Twitter, our study found meaningful differences in public emotions and topics regarding the COVID-19 pandemic among different UK cities. Furthermore, efficient location-based and time-based comparative analysis can be used to track people's thoughts and feelings, and to understand their behaviors. Based on our analysis, positive attitudes were common during the pandemic; optimism and anticipation were the dominant emotions. With the outbreak and epidemiological change, the government developed control measures and vaccination policies, and the topics also shifted over time. Overall, the proportion and expressions of emojis, sentiments, emotions, and topics varied geographically and temporally. Therefore, our approach of exploring public emotions and topics on the pandemic from Twitter can potentially lead to informing how public policies are received in a particular geographical area.


Assuntos
COVID-19 , Mídias Sociais , COVID-19/epidemiologia , Mineração de Dados , Emoções , Humanos , Pandemias , SARS-CoV-2
5.
Bioinformatics ; 36(19): 4910-4917, 2020 12 08.
Artigo em Inglês | MEDLINE | ID: mdl-33141147

RESUMO

MOTIVATION: Recent neural approaches on event extraction from text mainly focus on flat events in general domain, while there are less attempts to detect nested and overlapping events. These existing systems are built on given entities and they depend on external syntactic tools. RESULTS: We propose an end-to-end neural nested event extraction model named DeepEventMine that extracts multiple overlapping directed acyclic graph structures from a raw sentence. On the top of the bidirectional encoder representations from transformers model, our model detects nested entities and triggers, roles, nested events and their modifications in an end-to-end manner without any syntactic tools. Our DeepEventMine model achieves the new state-of-the-art performance on seven biomedical nested event extraction tasks. Even when gold entities are unavailable, our model can detect events from raw text with promising performance. AVAILABILITY AND IMPLEMENTATION: Our codes and models to reproduce the results are available at: https://github.com/aistairc/DeepEventMine. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Idioma , Projetos de Pesquisa
6.
Bioinformatics ; 35(10): 1799-1801, 2019 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-30329013

RESUMO

SUMMARY: Although the publication rate of the biomedical literature has been growing steadily during the last decades, the accessibility of pertinent research publications for biologist and medical practitioners remains a challenge. This article describes Thalia, which is a semantic search engine that can recognize eight different types of concepts occurring in biomedical abstracts. Thalia is available via a web-based interface or a RESTful API. A key aspect of our search engine is that it is updated from PubMed on a daily basis. We describe here the main building blocks of our tool as well as an evaluation of the retrieval capabilities of Thalia in the context of a precision medicine dataset. AVAILABILITY AND IMPLEMENTATION: Thalia is available at http://nactem.ac.uk/Thalia_BI/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Ferramenta de Busca , Internet , PubMed , Semântica
7.
Neurocomputing (Amst) ; 413: 431-443, 2020 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-33162674

RESUMO

Most deep language understanding models depend only on word representations, which are mainly based on language modelling derived from a large amount of raw text. These models encode distributional knowledge without considering syntactic structural information, although several studies have shown benefits of including such information. Therefore, we propose new syntactically-informed word representations (SIWRs), which allow us to enrich the pre-trained word representations with syntactic information without training language models from scratch. To obtain SIWRs, a graph-based neural model is built on top of either static or contextualised word representations such as GloVe, ELMo and BERT. The model is first pre-trained with only a relatively modest amount of task-independent data that are automatically annotated using existing syntactic tools. SIWRs are then obtained by applying the model to downstream task data and extracting the intermediate word representations. We finally replace word representations in downstream models with SIWRs for applications. We evaluate SIWRs on three information extraction tasks, namely nested named entity recognition (NER), binary and n-ary relation extractions (REs). The results demonstrate that our SIWRs yield performance gains over the base representations in these NLP tasks with 3-9% relative error reduction. Our SIWRs also perform better than fine-tuning BERT in binary RE. We also conduct extensive experiments to analyse the proposed method.

8.
BMC Bioinformatics ; 20(1): 430, 2019 Aug 17.
Artigo em Inglês | MEDLINE | ID: mdl-31419946

RESUMO

*: Background Consisting of dictated free-text documents such as discharge summaries, medical narratives are widely used in medical natural language processing. Relationships between anatomical entities and human body parts are crucial for building medical text mining applications. To achieve this, we establish a mapping system consisting of a Wikipedia-based scoring algorithm and a named entity normalization method (NEN). The mapping system makes full use of information available on Wikipedia, which is a comprehensive Internet medical knowledge base. We also built a new ontology, Tree of Human Body Parts (THBP), from core anatomical parts by referring to anatomical experts and Unified Medical Language Systems (UMLS) to make the mapping system efficacious for clinical treatments. *: Result The gold standard is derived from 50 discharge summaries from our previous work, in which 2,224 anatomical entities are included. The F1-measure of the baseline system is 70.20%, while our algorithm based on Wikipedia achieves 86.67% with the assistance of NEN. *: Conclusions We construct a framework to map anatomical entities to THBP ontology using normalization and a scoring algorithm based on Wikipedia. The proposed framework is proven to be much more effective and efficient than the main baseline system.


Assuntos
Anatomia , Mineração de Dados , Corpo Humano , Bases de Conhecimento , Alta do Paciente , Algoritmos , Humanos
9.
Bioinformatics ; 34(8): 1389-1397, 2018 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-29228271

RESUMO

Motivation: Pathway models are valuable resources that help us understand the various mechanisms underpinning complex biological processes. Their curation is typically carried out through manual inspection of published scientific literature to find information relevant to a model, which is a laborious and knowledge-intensive task. Furthermore, models curated manually cannot be easily updated and maintained with new evidence extracted from the literature without automated support. Results: We have developed LitPathExplorer, a visual text analytics tool that integrates advanced text mining, semi-supervised learning and interactive visualization, to facilitate the exploration and analysis of pathway models using statements (i.e. events) extracted automatically from the literature and organized according to levels of confidence. LitPathExplorer supports pathway modellers and curators alike by: (i) extracting events from the literature that corroborate existing models with evidence; (ii) discovering new events which can update models; and (iii) providing a confidence value for each event that is automatically computed based on linguistic features and article metadata. Our evaluation of event extraction showed a precision of 89% and a recall of 71%. Evaluation of our confidence measure, when used for ranking sampled events, showed an average precision ranging between 61 and 73%, which can be improved to 95% when the user is involved in the semi-supervised learning process. Qualitative evaluation using pair analytics based on the feedback of three domain experts confirmed the utility of our tool within the context of pathway model exploration. Availability and implementation: LitPathExplorer is available at http://nactem.ac.uk/LitPathExplorer_BI/. Contact: sophia.ananiadou@manchester.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Gráficos por Computador , Mineração de Dados/métodos , Aprendizado de Máquina Supervisionado , Publicações
10.
BMC Med Inform Decis Mak ; 19(Suppl 7): 273, 2019 12 23.
Artigo em Inglês | MEDLINE | ID: mdl-31865903

RESUMO

BACKGROUND: Clinical Named Entity Recognition is to find the name of diseases, body parts and other related terms from the given text. Because Chinese language is quite different with English language, the machine cannot simply get the graphical and phonetic information form Chinese characters. The method for Chinese should be different from that for English. Chinese characters present abundant information with the graphical features, recent research on Chinese word embedding tries to use graphical information as subword. This paper uses both graphical and phonetic features to improve Chinese Clinical Named Entity Recognition based on the presence of phono-semantic characters. METHODS: This paper proposed three different embedding models and tested them on the annotated data. The data have been divided into two sections for exploring the effect of the proportion of phono-semantic characters. RESULTS: The model using primary radical and pinyin can improve Clinical Named Entity Recognition in Chinese and get the F-measure of 0.712. More phono-semantic characters does not give a better result. CONCLUSIONS: The paper proves that the use of the combination of graphical and phonetic features can improve the Clinical Named Entity Recognition in Chinese.


Assuntos
Idioma , Aprendizado de Máquina , Processamento de Linguagem Natural , Fonética , Curadoria de Dados , Registros Eletrônicos de Saúde , Humanos , Semântica
11.
BMC Med Inform Decis Mak ; 19(1): 256, 2019 12 05.
Artigo em Inglês | MEDLINE | ID: mdl-31805934

RESUMO

BACKGROUND: Machine learning can assist with multiple tasks during systematic reviews to facilitate the rapid retrieval of relevant references during screening and to identify and extract information relevant to the study characteristics, which include the PICO elements of patient/population, intervention, comparator, and outcomes. The latter requires techniques for identifying and categorising fragments of text, known as named entity recognition. METHODS: A publicly available corpus of PICO annotations on biomedical abstracts is used to train a named entity recognition model, which is implemented as a recurrent neural network. This model is then applied to a separate collection of abstracts for references from systematic reviews within biomedical and health domains. The occurrences of words tagged in the context of specific PICO contexts are used as additional features for a relevancy classification model. Simulations of the machine learning-assisted screening are used to evaluate the work saved by the relevancy model with and without the PICO features. Chi-squared and statistical significance of positive predicted values are used to identify words that are more indicative of relevancy within PICO contexts. RESULTS: Inclusion of PICO features improves the performance metric on 15 of the 20 collections, with substantial gains on certain systematic reviews. Examples of words whose PICO context are more precise can explain this increase. CONCLUSIONS: Words within PICO tagged segments in abstracts are predictive features for determining inclusion. Combining PICO annotation model into the relevancy classification pipeline is a promising approach. The annotations may be useful on their own to aid users in pinpointing necessary information for data extraction, or to facilitate semantic search.


Assuntos
Bases de Dados Genéticas , Disseminação de Informação , Aprendizado de Máquina , Semântica , Revisões Sistemáticas como Assunto , Humanos
12.
Bioinformatics ; 33(23): 3784-3792, 2017 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-29036627

RESUMO

MOTIVATION: In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. RESULTS: We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. AVAILABILITY AND IMPLEMENTATION: The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. CONTACT: sophia.ananiadou@manchester.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Mineração de Dados/métodos , Aprendizado de Máquina , Incerteza , Publicações
13.
Diabetes Obes Metab ; 20(7): 1787-1792, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29536603

RESUMO

Biosimilar insulins have expanded the treatment options for diabetes. We compared the clinical efficacy and safety of biosimilar insulins with those of originator insulins by conducting a meta-analysis. A random-effects meta-analysis was performed on randomized controlled trials comparing biosimilar and originator insulins in adults with diabetes. Studies were obtained by searching electronic databases up to December 2017. Ten trials, in a total of 4935 patients, were assessed (2 trials each on LY2963016, MK-1293, Mylan's insulin glargine and SAR342434, and 1 trial each on FFP-112 and Basalog). The meta-analysis found no differences between long-acting biosimilar and originator insulins with regard to reduction in glycated haemoglobin at 24 weeks (0.04%, 95% confidence interval [CI] -0.01, 0.08; P for efficacy = .14, I2 = 0%) or at 52 weeks (0.03%, 95% CI -0.04, 0.1), or reduction in fasting plasma glucose (0.08 mmol/L, 95% CI 0.36, 0.53), hypoglycaemia (odds ratio 0.99, 95% CI 0.96, 1.03), mortality, injection site reactions, insulin antibodies and allergic reactions. Analyses stratified by type of diabetes and prior insulin use yielded similar findings. Similarly, no significant differences were found between short-acting biosimilar and originator insulins. In summary, our meta-analysis showed no significant differences in clinical efficacy and safety, including immune reactions, between biosimilar and originator insulins. Biosimilar insulins can increase access to modern insulin therapy and reduce medical costs.


Assuntos
Medicamentos Biossimilares/uso terapêutico , Diabetes Mellitus/tratamento farmacológico , Hipoglicemiantes/uso terapêutico , Insulina Glargina/análogos & derivados , Insulina Glargina/uso terapêutico , Insulina Lispro/uso terapêutico , Glicemia/metabolismo , Diabetes Mellitus/metabolismo , Hemoglobinas Glicadas/metabolismo , Humanos , Hipoglicemia/induzido quimicamente , Insulina/uso terapêutico , Anticorpos Anti-Insulina/imunologia
14.
BMC Med Inform Decis Mak ; 18(1): 46, 2018 06 25.
Artigo em Inglês | MEDLINE | ID: mdl-29940927

RESUMO

BACKGROUND: Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author's intended knowledge gain) and New Knowledge (an author's findings). The method incorporates various features, including a combination of simple MK dimensions. METHODS: We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated. RESULTS: We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836). CONCLUSION: We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications.


Assuntos
Pesquisa Biomédica/métodos , Mineração de Dados/métodos , Conhecimento , Projetos de Pesquisa , Máquina de Vetores de Suporte , Humanos
15.
Clin Sci (Lond) ; 131(20): 2525-2532, 2017 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-29026002

RESUMO

BACKGROUND: Findings from in vivo research may be less reliable where studies do not report measures to reduce risks of bias. The experimental stroke community has been at the forefront of implementing changes to improve reporting, but it is not known whether these efforts are associated with continuous improvements. Our aims here were firstly to validate an automated tool to assess risks of bias in published works, and secondly to assess the reporting of measures taken to reduce the risk of bias within recent literature for two experimental models of stroke. METHODS: We developed and used text analytic approaches to automatically ascertain reporting of measures to reduce risk of bias from full-text articles describing animal experiments inducing middle cerebral artery occlusion (MCAO) or modelling lacunar stroke. RESULTS: Compared with previous assessments, there were improvements in the reporting of measures taken to reduce risks of bias in the MCAO literature but not in the lacunar stroke literature. Accuracy of automated annotation of risk of bias in the MCAO literature was 86% (randomization), 94% (blinding) and 100% (sample size calculation); and in the lacunar stroke literature accuracy was 67% (randomization), 91% (blinding) and 96% (sample size calculation). DISCUSSION: There remains substantial opportunity for improvement in the reporting of animal research modelling stroke, particularly in the lacunar stroke literature. Further, automated tools perform sufficiently well to identify whether studies report blinded assessment of outcome, but improvements are required in the tools to ascertain whether randomization and a sample size calculation were reported.


Assuntos
Isquemia Encefálica/complicações , Infarto da Artéria Cerebral Média/complicações , Acidente Vascular Cerebral/complicações , Animais , Viés , Modelos Animais de Doenças , Humanos , Risco
16.
J Biomed Inform ; 72: 67-76, 2017 08.
Artigo em Inglês | MEDLINE | ID: mdl-28648605

RESUMO

Citation screening, an integral process within systematic reviews that identifies citations relevant to the underlying research question, is a time-consuming and resource-intensive task. During the screening task, analysts manually assign a label to each citation, to designate whether a citation is eligible for inclusion in the review. Recently, several studies have explored the use of active learning in text classification to reduce the human workload involved in the screening task. However, existing approaches require a significant amount of manually labelled citations for the text classification to achieve a robust performance. In this paper, we propose a semi-supervised method that identifies relevant citations as early as possible in the screening process by exploiting the pairwise similarities between labelled and unlabelled citations to improve the classification performance without additional manual labelling effort. Our approach is based on the hypothesis that similar citations share the same label (e.g., if one citation should be included, then other similar citations should be included also). To calculate the similarity between labelled and unlabelled citations we investigate two different feature spaces, namely a bag-of-words and a spectral embedding based on the bag-of-words. The semi-supervised method propagates the classification codes of manually labelled citations to neighbouring unlabelled citations in the feature space. The automatically labelled citations are combined with the manually labelled citations to form an augmented training set. For evaluation purposes, we apply our method to reviews from clinical and public health. The results show that our semi-supervised method with label propagation achieves statistically significant improvements over two state-of-the-art active learning approaches across both clinical and public health reviews.


Assuntos
Literatura de Revisão como Assunto , Automação , Curadoria de Dados , Humanos , Processamento de Linguagem Natural
17.
J Biomed Inform ; 62: 59-65, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-27293211

RESUMO

Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/.


Assuntos
Aprendizado de Máquina , Semântica , Classificação , Humanos , Literatura de Revisão como Assunto , Máquina de Vetores de Suporte
18.
J Biomed Inform ; 62: 148-58, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-27363901

RESUMO

OBJECTIVE: The abundance of text available in social media and health related forums along with the rich expression of public opinion have recently attracted the interest of the public health community to use these sources for pharmacovigilance. Based on the intuition that patients post about Adverse Drug Reactions (ADRs) expressing negative sentiments, we investigate the effect of sentiment analysis features in locating ADR mentions. METHODS: We enrich the feature space of a state-of-the-art ADR identification method with sentiment analysis features. Using a corpus of posts from the DailyStrength forum and tweets annotated for ADR and indication mentions, we evaluate the extent to which sentiment analysis features help in locating ADR mentions and distinguishing them from indication mentions. RESULTS: Evaluation results show that sentiment analysis features marginally improve ADR identification in tweets and health related forum posts. Adding sentiment analysis features achieved a statistically significant F-measure increase from 72.14% to 73.22% in the Twitter part of an existing corpus using its original train/test split. Using stratified 10×10-fold cross-validation, statistically significant F-measure increases were shown in the DailyStrength part of the corpus, from 79.57% to 80.14%, and in the Twitter part of the corpus, from 66.91% to 69.16%. Moreover, sentiment analysis features are shown to reduce the number of ADRs being recognized as indications. CONCLUSION: This study shows that adding sentiment analysis features can marginally improve the performance of even a state-of-the-art ADR identification method. This improvement can be of use to pharmacovigilance practice, due to the rapidly increasing popularity of social media and health forums.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Farmacovigilância , Mídias Sociais , Humanos , Internet , Saúde Pública
19.
BMC Bioinformatics ; 16 Suppl 10: S7, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26201408

RESUMO

BACKGROUND: Biomedical event extraction has been a major focus of biomedical natural language processing (BioNLP) research since the first BioNLP shared task was held in 2009. Accordingly, a large number of event extraction systems have been developed. Most such systems, however, have been developed for specific tasks and/or incorporated task specific settings, making their application to new corpora and tasks problematic without modification of the systems themselves. There is thus a need for event extraction systems that can achieve high levels of accuracy when applied to corpora in new domains, without the need for exhaustive tuning or modification, whilst retaining competitive levels of performance. RESULTS: We have enhanced our state-of-the-art event extraction system, EventMine, to alleviate the need for task-specific tuning. Task-specific details are specified in a configuration file, while extensive task-specific parameter tuning is avoided through the integration of a weighting method, a covariate shift method, and their combination. The task-specific configuration and weighting method have been employed within the context of two different sub-tasks of BioNLP shared task 2013, i.e. Cancer Genetics (CG) and Pathway Curation (PC), removing the need to modify the system specifically for each task. With minimal task specific configuration and tuning, EventMine achieved the 1st place in the PC task, and 2nd in the CG, achieving the highest recall for both tasks. The system has been further enhanced following the shared task by incorporating the covariate shift method and entity generalisations based on the task definitions, leading to further performance improvements. CONCLUSIONS: We have shown that it is possible to apply a state-of-the-art event extraction system to new tasks with high levels of performance, without having to modify the system internally. Both covariate shift and weighting methods are useful in facilitating the production of high recall systems. These methods and their combination can adapt a model to the target data with no deep tuning and little manual configuration.


Assuntos
Redes Reguladoras de Genes , Genes , Armazenamento e Recuperação da Informação , Modelos Teóricos , Processamento de Linguagem Natural , Neoplasias/genética , Neoplasias/patologia , Humanos , Bases de Conhecimento
20.
BMC Bioinformatics ; 16: 149, 2015 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-25956056

RESUMO

BACKGROUND: Electronic medical record (EMR) systems have become widely used throughout the world to improve the quality of healthcare and the efficiency of hospital services. A bilingual medical lexicon of Chinese and English is needed to meet the demand for the multi-lingual and multi-national treatment. We make efforts to extract a bilingual lexicon from English and Chinese discharge summaries with a small seed lexicon. The lexical terms can be classified into two categories: single-word terms (SWTs) and multi-word terms (MWTs). For SWTs, we use a label propagation (LP; context-based) method to extract candidates of translation pairs. For MWTs, which are pervasive in the medical domain, we propose a term alignment method, which firstly obtains translation candidates for each component word of a Chinese MWT, and then generates their combinations, from which the system selects a set of plausible translation candidates. RESULTS: We compare our LP method with a baseline method based on simple context-similarity. The LP based method outperforms the baseline with the accuracies: 4.44% Acc1, 24.44% Acc10, and 62.22% Acc100, where AccN means the top N accuracy. The accuracy of the LP method drops to 5.41% Acc10 and 8.11% Acc20 for MWTs. Our experiments show that the method based on term alignment improves the performance for MWTs to 16.22% Acc10 and 27.03% Acc20. CONCLUSIONS: We constructed a framework for building an English-Chinese term dictionary from discharge summaries in the two languages. Our experiments have shown that the LP-based method augmented with the term alignment method will contribute to reduction of manual work required to compile a bilingual sydictionary of clinical terms.


Assuntos
Multilinguismo , Processamento de Linguagem Natural , Alta do Paciente/normas , Software , Tradução , Povo Asiático , Inglaterra , Humanos , Armazenamento e Recuperação da Informação , Informática Médica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA