Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 49
Filtrar
1.
Trends Biotechnol ; 42(4): 402-417, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-37858386

RESUMEN

The surge in 'Big data' has significantly influenced biomaterials research and development, with vast data volumes emerging from clinical trials, scientific literature, electronic health records, and other sources. Biocompatibility is essential in developing safe medical devices and biomaterials to perform as intended without provoking adverse reactions. Therefore, establishing an artificial intelligence (AI)-driven biocompatibility definition has become decisive for automating data extraction and profiling safety effectiveness. This definition should both reflect the attributes related to biocompatibility and be compatible with computational data-mining methods. Here, we discuss the need for a comprehensive and contemporary definition of biocompatibility and the challenges in developing one. We also identify the key elements that comprise biocompatibility, and propose an integrated biocompatibility definition that enables data-mining approaches.


Asunto(s)
Inteligencia Artificial , Materiales Biocompatibles , Minería de Datos , Registros Electrónicos de Salud
2.
Database (Oxford) ; 20232023 11 28.
Artículo en Inglés | MEDLINE | ID: mdl-38015956

RESUMEN

It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug-gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug-gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug-gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical-protein relations described in the literature, or chemical compound-enzyme interactions. Database URL:  https://doi.org/10.5281/zenodo.4955410.


Asunto(s)
Minería de Datos , Reconocimiento de Normas Patrones Automatizadas , Humanos , Bases de Datos Factuales , Minería de Datos/métodos , Proteínas/metabolismo
3.
Database (Oxford) ; 20222022 10 05.
Artículo en Inglés | MEDLINE | ID: mdl-36197453

RESUMEN

The coronavirus disease 2019 (COVID-19) pandemic has compelled biomedical researchers to communicate data in real time to establish more effective medical treatments and public health policies. Nontraditional sources such as preprint publications, i.e. articles not yet validated by peer review, have become crucial hubs for the dissemination of scientific results. Natural language processing (NLP) systems have been recently developed to extract and organize COVID-19 data in reasoning systems. Given this scenario, the BioCreative COVID-19 text mining tool interactive demonstration track was created to assess the landscape of the available tools and to gauge user interest, thereby providing a two-way communication channel between NLP system developers and potential end users. The goal was to inform system designers about the performance and usability of their products and to suggest new additional features. Considering the exploratory nature of this track, the call for participation solicited teams to apply for the track, based on their system's ability to perform COVID-19-related tasks and interest in receiving user feedback. We also recruited volunteer users to test systems. Seven teams registered systems for the track, and >30 individuals volunteered as test users; these volunteer users covered a broad range of specialties, including bench scientists, bioinformaticians and biocurators. The users, who had the option to participate anonymously, were provided with written and video documentation to familiarize themselves with the NLP tools and completed a survey to record their evaluation. Additional feedback was also provided by NLP system developers. The track was well received as shown by the overall positive feedback from the participating teams and the users. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-4/.


Asunto(s)
COVID-19 , COVID-19/epidemiología , Minería de Datos/métodos , Bases de Datos Factuales , Documentación , Humanos , Procesamiento de Lenguaje Natural
4.
Database (Oxford) ; 20222022 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-36050787

RESUMEN

Monitoring drug safety is a central concern throughout the drug life cycle. Information about toxicity and adverse events is generated at every stage of this life cycle, and stakeholders have a strong interest in applying text mining and artificial intelligence (AI) methods to manage the ever-increasing volume of this information. Recognizing the importance of these applications and the role of challenge evaluations to drive progress in text mining, the organizers of BioCreative VII (Critical Assessment of Information Extraction in Biology) convened a panel of experts to explore 'Challenges in Mining Drug Adverse Reactions'. This article is an outgrowth of the panel; each panelist has highlighted specific text mining application(s), based on their research and their experiences in organizing text mining challenge evaluations. While these highlighted applications only sample the complexity of this problem space, they reveal both opportunities and challenges for text mining to aid in the complex process of drug discovery, testing, marketing and post-market surveillance. Stakeholders are eager to embrace natural language processing and AI tools to help in this process, provided that these tools can be demonstrated to add value to stakeholder workflows. This creates an opportunity for the BioCreative community to work in partnership with regulatory agencies, pharma and the text mining community to identify next steps for future challenge evaluations.


Asunto(s)
Inteligencia Artificial , Biología Computacional , Biología Computacional/métodos , Minería de Datos/métodos , Personal de Salud , Humanos , Procesamiento de Lenguaje Natural
5.
Biochim Biophys Acta Gene Regul Mech ; 1865(1): 194778, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34875418

RESUMEN

The regulation of gene transcription by transcription factors is a fundamental biological process, yet the relations between transcription factors (TF) and their target genes (TG) are still only sparsely covered in databases. Text-mining tools can offer broad and complementary solutions to help locate and extract mentions of these biological relationships in articles. We have generated ExTRI, a knowledge graph of TF-TG relationships, by applying a high recall text-mining pipeline to MedLine abstracts identifying over 100,000 candidate sentences with TF-TG relations. Validation procedures indicated that about half of the candidate sentences contain true TF-TG relationships. Post-processing identified 53,000 high confidence sentences containing TF-TG relationships, with a cross-validation F1-score close to 75%. The resulting collection of TF-TG relationships covers 80% of the relations annotated in existing databases. It adds 11,000 other potential interactions, including relationships for ~100 TFs currently not in public TF-TG relation databases. The high confidence abstract sentences contribute 25,000 literature references not available from other resources and offer a wealth of direct pointers to functional aspects of the TF-TG interactions. Our compiled resource encompassing ExTRI together with publicly available resources delivers literature-derived TF-TG interactions for more than 900 of the 1500-1600 proteins considered to function as specific DNA binding TFs. The obtained result can be used by curators, for network analysis and modelling, for causal reasoning or knowledge graph mining approaches, or serve to benchmark text mining strategies.


Asunto(s)
Minería de Datos , Regulación de la Expresión Génica , Minería de Datos/métodos , Factores de Transcripción/metabolismo
6.
Front Res Metr Anal ; 6: 674205, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34327299

RESUMEN

Analysis of high-throughput experiments in the life sciences frequently relies upon standardized information about genes, gene products, and other biological entities. To provide this information, expert curators are increasingly relying on text mining tools to identify, extract and harmonize statements from biomedical journal articles that discuss findings of interest. For determining reliability of the statements, curators need the evidence used by the authors to support their assertions. It is important to annotate the evidence directly used by authors to qualify their findings rather than simply annotating mentions of experimental methods without the context of what findings they support. Text mining tools require tuning and adaptation to achieve accurate performance. Many annotated corpora exist to enable developing and tuning text mining tools; however, none currently provides annotations of evidence based on the extensive and widely used Evidence and Conclusion Ontology. We present the ECO-CollecTF corpus, a novel, freely available, biomedical corpus of 84 documents that captures high-quality, evidence-based statements annotated with the Evidence and Conclusion Ontology.

7.
Genomics Inform ; 17(2): e15, 2019 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-31307130

RESUMEN

Automatically detecting mentions of pharmaceutical drugs and chemical substances is key for the subsequent extraction of relations of chemicals with other biomedical entities such as genes, proteins, diseases, adverse reactions or symptoms. The identification of drug mentions is also a prior step for complex event types such as drug dosage recognition, duration of medical treatments or drug repurposing. Formally, this task is known as named entity recognition (NER), meaning automatically identifying mentions of predefined entities of interest in running text. In the domain of medical texts, for chemical entity recognition (CER), techniques based on hand-crafted rules and graph-based models can provide adequate performance. In the recent years, the field of natural language processing has mainly pivoted to deep learning and state-of-the-art results for most tasks involving natural language are usually obtained with artificial neural networks. Competitive resources for drug name recognition in English medical texts are already available and heavily used, while for other languages such as Spanish these tools, although clearly needed were missing. In this work, we adapt an existing neural NER system, NeuroNER, to the particular domain of Spanish clinical case texts, and extend the neural network to be able to take into account additional features apart from the plain text. NeuroNER can be considered a competitive baseline system for Spanish drug and CER promoted by the Spanish national plan for the advancement of language technologies (Plan TL). PharmacoNER Tagger can be accessed at https://github.com/PlanTL-SANIDAD/PharmacoNER.

8.
J Cheminform ; 11(1): 42, 2019 Jun 24.
Artículo en Inglés | MEDLINE | ID: mdl-31236786

RESUMEN

BACKGROUND: Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called "Technical interoperability and performance of annotation servers" was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. RESULTS: A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. CONCLUSIONS: The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents.

9.
Sci Rep ; 7(1): 4474, 2017 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-28667284

RESUMEN

Epidemiological studies indicate that patients suffering from Alzheimer's disease have a lower risk of developing lung cancer, and suggest a higher risk of developing glioblastoma. Here we explore the molecular scenarios that might underlie direct and inverse co-morbidities between these diseases. Transcriptomic meta-analyses reveal significant numbers of genes with inverse patterns of expression in Alzheimer's disease and lung cancer, and with similar patterns of expression in Alzheimer's disease and glioblastoma. These observations support the existence of molecular substrates that could at least partially account for these direct and inverse co-morbidity relationships. A functional analysis of the sets of deregulated genes points to the immune system, up-regulated in both Alzheimer's disease and glioblastoma, as a potential link between these two diseases. Mitochondrial metabolism is regulated oppositely in Alzheimer's disease and lung cancer, indicating that it may be involved in the inverse co-morbidity between these diseases. Finally, oxidative phosphorylation is a good candidate to play a dual role by decreasing or increasing the risk of lung cancer and glioblastoma in Alzheimer's disease.


Asunto(s)
Enfermedad de Alzheimer/epidemiología , Enfermedad de Alzheimer/genética , Glioblastoma/epidemiología , Glioblastoma/genética , Neoplasias Pulmonares/epidemiología , Neoplasias Pulmonares/genética , Enfermedad de Alzheimer/metabolismo , Biomarcadores , Comorbilidad , Susceptibilidad a Enfermedades , Expresión Génica , Variación Genética , Glioblastoma/metabolismo , Humanos , Neoplasias Pulmonares/metabolismo , Modelos Biológicos , Especies Reactivas de Oxígeno/metabolismo , Transducción de Señal
10.
Nucleic Acids Res ; 45(W1): W484-W489, 2017 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-28531339

RESUMEN

A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes-CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es.


Asunto(s)
Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Programas Informáticos , Sistema Enzimático del Citocromo P-450 , Minería de Datos , Genes , Internet , Hígado/efectos de los fármacos
11.
Chem Rev ; 117(12): 7673-7761, 2017 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-28475312

RESUMEN

Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.


Asunto(s)
Química/métodos , Minería de Datos/métodos , Tecnología/métodos , Documentación
12.
Artículo en Inglés | MEDLINE | ID: mdl-27542845

RESUMEN

Biomedical text mining methods and technologies have improved significantly in the last decade. Considerable efforts have been invested in understanding the main challenges of biomedical literature retrieval and extraction and proposing solutions to problems of practical interest. Most notably, community-oriented initiatives such as the BioCreative challenge have enabled controlled environments for the comparison of automatic systems while pursuing practical biomedical tasks. Under this scenario, the present work describes the Markyt Web-based document curation platform, which has been implemented to support the visualisation, prediction and benchmark of chemical and gene mention annotations at BioCreative/CHEMDNER challenge. Creating this platform is an important step for the systematic and public evaluation of automatic prediction systems and the reusability of the knowledge compiled for the challenge. Markyt was not only critical to support the manual annotation and annotation revision process but also facilitated the comparative visualisation of automated results against the manually generated Gold Standard annotations and comparative assessment of generated results. We expect that future biomedical text mining challenges and the text mining community may benefit from the Markyt platform to better explore and interpret annotations and improve automatic system predictions.Database URL: http://www.markyt.org, https://github.com/sing-group/Markyt.


Asunto(s)
Minería de Datos/métodos , Genes , Procesamiento de Lenguaje Natural , Preparaciones Farmacéuticas , Programas Informáticos , Animales , Humanos
13.
J Cheminform ; 7(Suppl 1 Text mining for chemistry and the CHEMDNER track): S1, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25810766

RESUMEN

Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties.

14.
J Cheminform ; 7(Suppl 1 Text mining for chemistry and the CHEMDNER track): S2, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25810773

RESUMEN

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

15.
Bioinformatics ; 31(12): 2052-3, 2015 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-25667547

RESUMEN

MOTIVATION: Most biological processes remain only partially characterized with many components still to be identified. Given that a whole genome can usually not be tested in a functional assay, identifying the genes most likely to be of interest is of critical importance to avoid wasting resources. RESULTS: Given a set of known functionally related genes and using a state-of-the-art approach to data integration and mining, our Functional Lists (FUN-L) method provides a ranked list of candidate genes for testing. Validation of predictions from FUN-L with independent RNAi screens confirms that FUN-L-produced lists are enriched in genes with the expected phenotypes. In this article, we describe a website front end to FUN-L. AVAILABILITY AND IMPLEMENTATION: The website is freely available to use at http://funl.org


Asunto(s)
Algoritmos , Biología Computacional/métodos , Minería de Datos/métodos , Redes Reguladoras de Genes , Interferencia de ARN , Programas Informáticos , Humanos , Fenotipo
16.
Mol Biol Cell ; 25(16): 2522-36, 2014 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-24943848

RESUMEN

The advent of genome-wide RNA interference (RNAi)-based screens puts us in the position to identify genes for all functions human cells carry out. However, for many functions, assay complexity and cost make genome-scale knockdown experiments impossible. Methods to predict genes required for cell functions are therefore needed to focus RNAi screens from the whole genome on the most likely candidates. Although different bioinformatics tools for gene function prediction exist, they lack experimental validation and are therefore rarely used by experimentalists. To address this, we developed an effective computational gene selection strategy that represents public data about genes as graphs and then analyzes these graphs using kernels on graph nodes to predict functional relationships. To demonstrate its performance, we predicted human genes required for a poorly understood cellular function-mitotic chromosome condensation-and experimentally validated the top 100 candidates with a focused RNAi screen by automated microscopy. Quantitative analysis of the images demonstrated that the candidates were indeed strongly enriched in condensation genes, including the discovery of several new factors. By combining bioinformatics prediction with experimental validation, our study shows that kernels on graph nodes are powerful tools to integrate public biological data and predict genes involved in cellular functions of interest.


Asunto(s)
Segregación Cromosómica/genética , Cromosomas/genética , Biología Computacional/métodos , Genoma , Células HeLa , Humanos , Microscopía Confocal , Mitosis , Fenotipo , Pronóstico , Interferencia de ARN , ARN Interferente Pequeño/genética , Programas Informáticos
18.
Database (Oxford) ; 2013: bat064, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24048470

RESUMEN

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/


Asunto(s)
Investigación Biomédica , Minería de Datos , Procesamiento de Lenguaje Natural , Programas Informáticos , Humanos
19.
Database (Oxford) ; 2013: bas056, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23327936

RESUMEN

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.


Asunto(s)
Minería de Datos , Educación , Bases de Datos como Asunto , Documentación , Humanos , Programas Informáticos , Factores de Tiempo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...