Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
J Biomed Inform ; 143: 104433, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37385326

RESUMO

MOTIVATION: Entity linking is the task of linking entity mentions to the database entries corresponding to the entity mentions. Entity linking enables the treatment of superficially different but semantically identical mentions as the same entity. Since millions of concepts are listed in biomedical databases, selecting the correct database entry for each targeted entity is challenging. Simple string matching between the word and each synonym in biomedical databases is insufficient to handle a wide variety of variants of biomedical entities appearing in the biomedical literature. Recent progress in neural approaches is promising for entity linking. Still, existing neural methods require sufficient data, which is difficult to prepare in biomedical entity linking that deals with millions of biomedical concepts. Therefore, we need to develop a new neural method to train entity-linking models over the sparse training data covering a very limited part of the biomedical concepts. RESULTS: We have devised a pure neural model that classifies biomedical entity mentions into millions of biomedical concepts. The classifier employs (1) the layer overwriting that breaks through the performance ceiling during training, (2) training data augmentation using database entries that compensate for the problem of insufficient training data, and (3) the cosine similarity-based loss function that helps distinguish the millions of biomedical concepts. Our system using the proposed classifier was ranked first in the official run of the National NLP Clinical Challenges (n2c2) 2019 Track 3, which targeted linking medical/clinical entity mentions to 434,056 Concept Unique Identifier (CUI) entries. We also applied our system to the MedMentions dataset, which has 3.2M candidate concepts. Experimental results confirmed the same advantages of our proposed method. We further evaluated our system on the NLM-CHEM corpus with 350K candidate concepts, and our system achieved a new state-of-the-art performance on the corpus. AVAILABILITY: https://github.com/tti-coin/bio-linking Contact:makoto.miwa@toyota-ti.ac.jp.


Assuntos
Mineração de Dados , Semântica , Mineração de Dados/métodos , Bases de Dados Factuais
2.
J Biomed Inform ; 144: 104416, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37321443

RESUMO

This paper describes contextualized medication event extraction for automatically identifying medication change events with their contexts from clinical notes. The striding named entity recognition (NER) model extracts medication name spans from an input text sequence using a sliding-window approach. Specifically, the striding NER model separates the input sequence into a set of overlapping subsequences of 512 tokens with 128 tokens of stride, processing each subsequence using a large pre-trained language model and aggregating the outputs from the subsequences. The event and context classification has been done with multi-turn question-answering (QA) and span-based models. The span-based model classifies the span of each medication name using the span representation of the language model. In the QA model, event classification is augmented with questions in classifying the change events of each medication name and the context of the change events, while the model architecture is a classification style that is the same as the span-based model. We evaluated our extraction system on the n2c2 2022 Track 1 dataset, which is annotated for medication extraction (ME), event classification (EC), and context classification (CC) from clinical notes. Our system is a pipeline of the striding NER model for ME and the ensemble of the span-based and QA-based models for EC and CC. Our system achieved a combined F-score of 66.47% for the end-to-end contextualized medication event extraction (Release 1), which is the highest score among the participants of the n2c2 2022 Track 1.


Assuntos
Sistemas de Medicação , Processamento de Linguagem Natural , Humanos , Idioma , Mineração de Dados , Registros Eletrônicos de Saúde
3.
Sci Rep ; 13(1): 5986, 2023 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-37045907

RESUMO

Idiopathic pulmonary fibrosis (IPF) is a severe and progressive chronic fibrosing interstitial lung disease with causes that have remained unclear to date. Development of effective treatments will require elucidation of the detailed pathogenetic mechanisms of IPF at both the molecular and cellular levels. With a biomedical corpus that includes IPF-related entities and events, text-mining systems can efficiently extract such mechanism-related information from huge amounts of literature on the disease. A novel corpus consisting of 150 abstracts with 9297 entities intended for training a text-mining system was constructed to clarify IPF-related pathogenetic mechanisms. For this corpus, entity information was annotated, as were relation and event information. To construct IPF-related networks, we also conducted entity normalization with IDs assigned to entities. Thereby, we extracted the same entities, which are expressed differently. Moreover, IPF-related events have been defined in this corpus, in contrast to existing corpora. This corpus will be useful to extract IPF-related information from scientific texts. Because many entities and events are related to lung diseases, this freely available corpus can also be used to extract information related to other lung diseases such as lung cancer and interstitial pneumonia caused by COVID-19.


Assuntos
COVID-19 , Fibrose Pulmonar Idiopática , Doenças Pulmonares Intersticiais , Neoplasias Pulmonares , Humanos , Fibrose Pulmonar Idiopática/patologia , Mineração de Dados
4.
J Biomed Inform ; 141: 104347, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37030658

RESUMO

Automatic extraction of patient medication histories from free-text clinical notes can increase the amount of relevant information to clinicians for developing treatment plans. In addition to detecting medication events, clinical text mining systems must also be able to predict event context, such as negation, uncertainty, and time of occurrence, in order to construct accurate patient timelines. Towards this goal, we introduce Levitated Context Markers (LCMs), a novel transformer-based model for contextualized event extraction. LCMs are an adaptation of levitated markers -originally developed for relation extraction- that allow pretrained transformer models to utilize global input representations while also focusing on event-related subspans using a sparse attention mechanism. In addition to outperforming a strong baseline model on the Contextualized Medication Event Dataset, we show that LCMs' sparse attention can provide interpretable predictions by detecting relevant context cues in an unsupervised manner.


Assuntos
Mineração de Dados , Registros , Humanos , Processamento de Linguagem Natural
5.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36416141

RESUMO

MOTIVATION: Most of the conventional deep neural network-based methods for drug-drug interaction (DDI) extraction consider only context information around drug mentions in the text. However, human experts use heterogeneous background knowledge about drugs to comprehend pharmaceutical papers and extract relationships between drugs. Therefore, we propose a novel method that simultaneously considers various heterogeneous information for DDI extraction from the literature. RESULTS: We first construct drug representations by conducting the link prediction task on a heterogeneous pharmaceutical knowledge graph (KG) dataset. We then effectively combine the text information of input sentences in the corpus and the information on drugs in the heterogeneous KG (HKG) dataset. Finally, we evaluate our DDI extraction method on the DDIExtraction-2013 shared task dataset. In the experiment, integrating heterogeneous drug information significantly improves the DDI extraction performance, and we achieved an F-score of 85.40%, which results in state-of-the-art performance. We evaluated our method on the DrugProt dataset and improved the performance significantly, achieving an F-score of 77.9%. Further analysis showed that each type of node in the HKG contributes to the performance improvement of DDI extraction, indicating the importance of considering multiple pieces of information. AVAILABILITY AND IMPLEMENTATION: Our code is available at https://github.com/tticoin/HKG-DDIE.git.


Assuntos
Mineração de Dados , Reconhecimento Automatizado de Padrão , Humanos , Reconhecimento Automatizado de Padrão/métodos , Mineração de Dados/métodos , Interações Medicamentosas , Redes Neurais de Computação , Preparações Farmacêuticas
6.
BMC Bioinformatics ; 23(1): 211, 2022 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-35655127

RESUMO

BACKGROUND: Nested and overlapping events are particularly frequent and informative structures in biomedical event extraction. However, state-of-the-art neural models either neglect those structures during learning or use syntactic features and external tools to detect them. To overcome these limitations, this paper presents and compares two neural models: a novel EXhaustive Neural Network (EXNN) and a Search-Based Neural Network (SBNN) for detection of nested and overlapping events. RESULTS: We evaluate the proposed models as an event detection component in isolation and within a pipeline setting. Evaluation in several annotated biomedical event extraction datasets shows that both EXNN and SBNN achieve higher performance in detecting nested and overlapping events, compared to the state-of-the-art model Turku Event Extraction System (TEES). CONCLUSIONS: The experimental results reveal that both EXNN and SBNN are effective for biomedical event extraction. Furthermore, results on a pipeline setting indicate that our models improve detection of events compared to models that use either gold or predicted named entities.


Assuntos
Modelos Biológicos , Redes Neurais de Computação
7.
Bioinformatics ; 38(3): 872-874, 2022 01 12.
Artigo em Inglês | MEDLINE | ID: mdl-34636886

RESUMO

SUMMARY: Large-scale pre-trained language models (PLMs) have advanced state-of-the-art (SOTA) performance on various biomedical text mining tasks. The power of such PLMs can be combined with the advantages of deep generative models. These are examples of these combinations. However, they are trained only on general domain text, and biomedical models are still missing. In this work, we describe BioVAE, the first large-scale pre-trained latent variable language model for the biomedical domain, which uses the OPTIMUS framework to train on large volumes of biomedical text. The model shows SOTA performance on several biomedical text mining tasks when compared to existing publicly available biomedical PLMs. In addition, our model can generate more accurate biomedical sentences than the original OPTIMUS output. AVAILABILITY AND IMPLEMENTATION: Our source code and pre-trained models are freely available: https://github.com/aistairc/BioVAE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Mineração de Dados , Idioma , Software , Processamento de Linguagem Natural
8.
Front Res Metr Anal ; 6: 670206, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34278204

RESUMO

We deal with a heterogeneous pharmaceutical knowledge-graph containing textual information built from several databases. The knowledge graph is a heterogeneous graph that includes a wide variety of concepts and attributes, some of which are provided in the form of textual pieces of information which have not been targeted in the conventional graph completion tasks. To investigate the utility of textual information for knowledge graph completion, we generate embeddings from textual descriptions given to heterogeneous items, such as drugs and proteins, while learning knowledge graph embeddings. We evaluate the obtained graph embeddings on the link prediction task for knowledge graph completion, which can be used for drug discovery and repurposing. We also compare the results with existing methods and discuss the utility of the textual information.

9.
Bioinformatics ; 37(12): 1739-1746, 2021 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-33098410

RESUMO

MOTIVATION: Neural methods to extract drug-drug interactions (DDIs) from literature require a large number of annotations. In this study, we propose a novel method to effectively utilize external drug database information as well as information from large-scale plain text for DDI extraction. Specifically, we focus on drug description and molecular structure information as the drug database information. RESULTS: We evaluated our approach on the DDIExtraction 2013 shared task dataset. We obtained the following results. First, large-scale raw text information can greatly improve the performance of extracting DDIs when combined with the existing model and it shows the state-of-the-art performance. Second, each of drug description and molecular structure information is helpful to further improve the DDI performance for some specific DDI types. Finally, the simultaneous use of the drug description and molecular structure information can significantly improve the performance on all the DDI types. We showed that the plain text, the drug description information and molecular structure information are complementary and their effective combination is essential for the improvement. AVAILABILITY AND IMPLEMENTATION: Our code is available at https://github.com/tticoin/DESC_MOL-DDIE.


Assuntos
Mineração de Dados , Preparações Farmacêuticas , Interações Medicamentosas , Estrutura Molecular , Publicações
10.
Neurocomputing (Amst) ; 413: 431-443, 2020 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-33162674

RESUMO

Most deep language understanding models depend only on word representations, which are mainly based on language modelling derived from a large amount of raw text. These models encode distributional knowledge without considering syntactic structural information, although several studies have shown benefits of including such information. Therefore, we propose new syntactically-informed word representations (SIWRs), which allow us to enrich the pre-trained word representations with syntactic information without training language models from scratch. To obtain SIWRs, a graph-based neural model is built on top of either static or contextualised word representations such as GloVe, ELMo and BERT. The model is first pre-trained with only a relatively modest amount of task-independent data that are automatically annotated using existing syntactic tools. SIWRs are then obtained by applying the model to downstream task data and extracting the intermediate word representations. We finally replace word representations in downstream models with SIWRs for applications. We evaluate SIWRs on three information extraction tasks, namely nested named entity recognition (NER), binary and n-ary relation extractions (REs). The results demonstrate that our SIWRs yield performance gains over the base representations in these NLP tasks with 3-9% relative error reduction. Our SIWRs also perform better than fine-tuning BERT in binary RE. We also conduct extensive experiments to analyse the proposed method.

11.
Bioinformatics ; 36(19): 4910-4917, 2020 12 08.
Artigo em Inglês | MEDLINE | ID: mdl-33141147

RESUMO

MOTIVATION: Recent neural approaches on event extraction from text mainly focus on flat events in general domain, while there are less attempts to detect nested and overlapping events. These existing systems are built on given entities and they depend on external syntactic tools. RESULTS: We propose an end-to-end neural nested event extraction model named DeepEventMine that extracts multiple overlapping directed acyclic graph structures from a raw sentence. On the top of the bidirectional encoder representations from transformers model, our model detects nested entities and triggers, roles, nested events and their modifications in an end-to-end manner without any syntactic tools. Our DeepEventMine model achieves the new state-of-the-art performance on seven biomedical nested event extraction tasks. Even when gold entities are unavailable, our model can detect events from raw text with promising performance. AVAILABILITY AND IMPLEMENTATION: Our codes and models to reproduce the results are available at: https://github.com/aistairc/DeepEventMine. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Idioma , Projetos de Pesquisa
12.
J Am Med Inform Assoc ; 27(1): 39-46, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31390003

RESUMO

OBJECTIVE: Identification of drugs, associated medication entities, and interactions among them are crucial to prevent unwanted effects of drug therapy, known as adverse drug events. This article describes our participation to the n2c2 shared-task in extracting relations between medication-related entities in electronic health records. MATERIALS AND METHODS: We proposed an ensemble approach for relation extraction and classification between drugs and medication-related entities. We incorporated state-of-the-art named-entity recognition (NER) models based on bidirectional long short-term memory (BiLSTM) networks and conditional random fields (CRF) for end-to-end extraction. We additionally developed separate models for intra- and inter-sentence relation extraction and combined them using an ensemble method. The intra-sentence models rely on bidirectional long short-term memory networks and attention mechanisms and are able to capture dependencies between multiple related pairs in the same sentence. For the inter-sentence relations, we adopted a neural architecture that utilizes the Transformer network to improve performance in longer sequences. RESULTS: Our team ranked third with a micro-averaged F1 score of 94.72% and 87.65% for relation and end-to-end relation extraction, respectively (Tracks 2 and 3). Our ensemble effectively takes advantages from our proposed models. Analysis of the reported results indicated that our proposed approach is more generalizable than the top-performing system, which employs additional training data- and corpus-driven processing techniques. CONCLUSIONS: We proposed a relation extraction system to identify relations between drugs and medication-related entities. The proposed approach is independent of external syntactic tools. Analysis showed that by using latent Drug-Drug interactions we were able to significantly improve the performance of non-Drug-Drug pairs in EHRs.


Assuntos
Aprendizado Profundo , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Interações Medicamentosas , Humanos , Redes Neurais de Computação
13.
J Am Med Inform Assoc ; 27(1): 22-30, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31197355

RESUMO

OBJECTIVE: This article describes an ensembling system to automatically extract adverse drug events and drug related entities from clinical narratives, which was developed for the 2018 n2c2 Shared Task Track 2. MATERIALS AND METHODS: We designed a neural model to tackle both nested (entities embedded in other entities) and polysemous entities (entities annotated with multiple semantic types) based on MIMIC III discharge summaries. To better represent rare and unknown words in entities, we further tokenized the MIMIC III data set by splitting the words into finer-grained subwords. We finally combined all the models to boost the performance. Additionally, we implemented a featured-based conditional random field model and created an ensemble to combine its predictions with those of the neural model. RESULTS: Our method achieved 92.78% lenient micro F1-score, with 95.99% lenient precision, and 89.79% lenient recall, respectively. Experimental results showed that combining the predictions of either multiple models, or of a single model with different settings can improve performance. DISCUSSION: Analysis of the development set showed that our neural models can detect more informative text regions than feature-based conditional random field models. Furthermore, most entity types significantly benefit from subword representation, which also allows us to extract sparse entities, especially nested entities. CONCLUSION: The overall results have demonstrated that the ensemble method can accurately recognize entities, including nested and polysemous entities. Additionally, our method can recognize sparse entities by reconsidering the clinical narratives at a finer-grained subword level, rather than at the word level.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Redes Neurais de Computação , Humanos , Narração
14.
J Biomed Inform ; 62: 59-65, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-27293211

RESUMO

Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/.


Assuntos
Aprendizado de Máquina , Semântica , Classificação , Humanos , Literatura de Revisão como Assunto , Máquina de Vetores de Suporte
15.
Bull Environ Contam Toxicol ; 96(4): 524-9, 2016 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-26728279

RESUMO

The potential for the formation of chlorinated polycyclic aromatic hydrocarbons via photochlorination of PAHs has been investigated in milli-Q water/synthetic water containing NaCl and PAHs with either UV or visible light. The photochlorination of pyrene occurred under acidic conditions in the presence of both UV and visible light, resulting in 1-chloropyrene as the main product. Benzo[a]pyrene yielded 6-chlorobenzo[a]pyrene following visible light irradiation; however the reaction was dependent upon solution pH. The photochlorination of PAHs was proposed to proceed via a consecutive reaction model. The rate constants associated with the photochlorination and photodecay processes were determined with the observed and theoretical values displaying similar trends, whereas the observed values were approximately 50-1000 times lower than the theoretical values. The lower observed values could be due to undergo photodecay rather than photochlorination of PAHs. Therefore, as photochlorination of PAHs appears to be significantly affected by solution pH, this information may allow for minimizing the impact on the environment.


Assuntos
Hidrocarbonetos Clorados/análise , Luz , Hidrocarbonetos Policíclicos Aromáticos/química , Poluentes Químicos da Água/química , Concentração de Íons de Hidrogênio , Modelos Teóricos , Fotólise , Hidrocarbonetos Policíclicos Aromáticos/efeitos da radiação , Sais , Soluções , Raios Ultravioleta , Poluentes Químicos da Água/efeitos da radiação
16.
BMC Bioinformatics ; 16 Suppl 10: S7, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26201408

RESUMO

BACKGROUND: Biomedical event extraction has been a major focus of biomedical natural language processing (BioNLP) research since the first BioNLP shared task was held in 2009. Accordingly, a large number of event extraction systems have been developed. Most such systems, however, have been developed for specific tasks and/or incorporated task specific settings, making their application to new corpora and tasks problematic without modification of the systems themselves. There is thus a need for event extraction systems that can achieve high levels of accuracy when applied to corpora in new domains, without the need for exhaustive tuning or modification, whilst retaining competitive levels of performance. RESULTS: We have enhanced our state-of-the-art event extraction system, EventMine, to alleviate the need for task-specific tuning. Task-specific details are specified in a configuration file, while extensive task-specific parameter tuning is avoided through the integration of a weighting method, a covariate shift method, and their combination. The task-specific configuration and weighting method have been employed within the context of two different sub-tasks of BioNLP shared task 2013, i.e. Cancer Genetics (CG) and Pathway Curation (PC), removing the need to modify the system specifically for each task. With minimal task specific configuration and tuning, EventMine achieved the 1st place in the PC task, and 2nd in the CG, achieving the highest recall for both tasks. The system has been further enhanced following the shared task by incorporating the covariate shift method and entity generalisations based on the task definitions, leading to further performance improvements. CONCLUSIONS: We have shown that it is possible to apply a state-of-the-art event extraction system to new tasks with high levels of performance, without having to modify the system internally. Both covariate shift and weighting methods are useful in facilitating the production of high recall systems. These methods and their combination can adapt a model to the target data with no deep tuning and little manual configuration.


Assuntos
Redes Reguladoras de Genes , Genes , Armazenamento e Recuperação da Informação , Modelos Teóricos , Processamento de Linguagem Natural , Neoplasias/genética , Neoplasias/patologia , Humanos , Bases de Conhecimento
18.
J Biomed Inform ; 56: 94-102, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26004792

RESUMO

Many text mining applications in the biomedical domain benefit from automatic clustering of relational phrases into synonymous groups, since it alleviates the problem of spurious mismatches caused by the diversity of natural language expressions. Most of the previous work that has addressed this task of synonymy resolution uses similarity metrics between relational phrases based on textual strings or dependency paths, which, for the most part, ignore the context around the relations. To overcome this shortcoming, we employ a word embedding technique to encode relational phrases. We then apply the k-means algorithm on top of the distributional representations to cluster the phrases. Our experimental results show that this approach outperforms state-of-the-art statistical models including latent Dirichlet allocation and Markov logic networks.


Assuntos
Mineração de Dados/métodos , Processamento de Linguagem Natural , Vocabulário Controlado , Algoritmos , Análise por Conglomerados , Bases de Dados Factuais , Reações Falso-Positivas , Lógica Fuzzy , MEDLINE , Cadeias de Markov , Informática Médica/métodos , Modelos Estatísticos , Probabilidade , Reprodutibilidade dos Testes , Semântica
19.
BMC Bioinformatics ; 16: 107, 2015 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25887686

RESUMO

BACKGROUND: Relation extraction is a fundamental technology in biomedical text mining. Most of the previous studies on relation extraction from biomedical literature have focused on specific or predefined types of relations, which inherently limits the types of the extracted relations. With the aim of fully leveraging the knowledge described in the literature, we address much broader types of semantic relations using a single extraction framework. RESULTS: Our system, which we name PASMED, extracts diverse types of binary relations from biomedical literature using deep syntactic patterns. Our experimental results demonstrate that it achieves a level of recall considerably higher than the state of the art, while maintaining reasonable precision. We have then applied PASMED to the whole MEDLINE corpus and extracted more than 137 million semantic relations. The extracted relations provide a quantitative understanding of what kinds of semantic relations are actually described in MEDLINE and can be ultimately extracted by (possibly type-specific) relation extraction systems. CONCLUSION: PASMED extracts a large number of relations that have previously been missed by existing text mining systems. The entire collection of the relations extracted from MEDLINE is publicly available in machine-readable form, so that it can serve as a potential knowledge base for high-level text-mining applications.


Assuntos
Mineração de Dados/métodos , MEDLINE , Semântica
20.
Syst Rev ; 4: 5, 2015 Jan 14.
Artigo em Inglês | MEDLINE | ID: mdl-25588314

RESUMO

BACKGROUND: The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities. METHODS: Five research questions led our review: what is the state of the evidence base; how has workload reduction been evaluated; what are the purposes of semi-automation and how effective are they; how have key contextual problems of applying text mining to the systematic review field been addressed; and what challenges to implementation have emerged? We answered these questions using standard systematic review methods: systematic and exhaustive searching, quality-assured data extraction and a narrative synthesis to synthesise findings. RESULTS: The evidence base is active and diverse; there is almost no replication between studies or collaboration between research teams and, whilst it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable. On the whole, most suggested that a saving in workload of between 30% and 70% might be possible, though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e. a 95% recall). CONCLUSIONS: Using text mining to prioritise the order in which items are screened should be considered safe and ready for use in 'live' reviews. The use of text mining as a 'second screener' may also be used cautiously. The use of text mining to eliminate studies automatically should be considered promising, but not yet fully proven. In highly technical/clinical areas, it may be used with a high degree of confidence; but more developmental and evaluative work is needed in other disciplines.


Assuntos
Biologia Computacional , Mineração de Dados/métodos , Biologia Computacional/métodos , Biologia Computacional/tendências , Mineração de Dados/tendências , Bases de Dados Factuais , Medicina Baseada em Evidências , Humanos , Armazenamento e Recuperação da Informação/tendências , Publicações
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA