Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
BMC Bioinformatics ; 15 Suppl 14: S6, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25472638

RESUMEN

Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.


Asunto(s)
Minería de Datos , Descubrimiento de Drogas , Descubrimiento del Conocimiento , Programas Informáticos
2.
J Biomed Inform ; 45(5): 851-61, 2012 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-22580177

RESUMEN

The mutual interactions among genes, diseases, and drugs are at the heart of biomedical research, and are especially important for the pharmacological industry. The recent trend towards personalized medicine makes it increasingly relevant to be able to tailor drugs to specific genetic makeups. The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB) aims at capturing relevant information about such interactions from several sources, including curation of the biomedical literature. Advanced text mining tools which can support the process of manual curation are increasingly necessary in order to cope with the deluge of new published results. However, effective evaluation of those tools requires the availability of manually curated data as gold standard. In this paper we discuss how the existing PharmGKB database can be used for such an evaluation task in a way similar to the usage of gold standard data derived from protein-protein interaction databases in one of the recent BioCreative shared tasks. Additionally, we present our own considerations and results on the feasibility and difficulty of such a task.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Bases del Conocimiento , Farmacogenética/métodos , Indización y Redacción de Resúmenes , Animales , Bases de Datos Factuales , Humanos
3.
BMC Bioinformatics ; 12 Suppl 8: S13, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151872

RESUMEN

BACKGROUND: This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT). RESULTS: Two main achievements are described in this paper: (a) a system for document classification which crucially relies on the results of an advanced pipeline of natural language processing tools; (b) a system which is capable of detecting all experimental methods mentioned in scientific literature, and listing them with a competitive ranking (AUC iP/R > 0.5). CONCLUSIONS: The results of the BioCreative III shared evaluation clearly demonstrate that significant progress has been achieved in the domain of biomedical text mining in the past few years. Our own contribution, together with the results of other participants, provides evidence that natural language processing techniques have become by now an integral part of advanced text mining approaches.


Asunto(s)
Minería de Datos/métodos , Proteínas/metabolismo , Biología Computacional/métodos , Procesamiento de Lenguaje Natural , Mapeo de Interacción de Proteínas/métodos
4.
BMC Bioinformatics ; 12 Suppl 8: S4, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151968

RESUMEN

BACKGROUND: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested. RESULTS: A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation. DISCUSSION: The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.


Asunto(s)
Minería de Datos/métodos , Genes , Animales , Biología Computacional/métodos , Publicaciones Periódicas como Asunto , Plantas/genética , Plantas/metabolismo
5.
Artículo en Inglés | MEDLINE | ID: mdl-27402677

RESUMEN

Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Factuales , Lenguajes de Programación , Humanos
6.
J Am Med Inform Assoc ; 22(5): 948-56, 2015 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-25948699

RESUMEN

OBJECTIVE: To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS: We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. RESULTS: The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. DISCUSSION: The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. CONCLUSION: To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.


Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Multilingüismo , Procesamiento de Lenguaje Natural , Terminología como Asunto , Semántica , Unified Medical Language System
7.
Artículo en Inglés | MEDLINE | ID: mdl-24903516

RESUMEN

Given the current explosion of data within original publications generated in the field of genomics, a recognized bottleneck is the transfer of such knowledge into comprehensive databases. We have for years organized knowledge on transcriptional regulation reported in the original literature of Escherichia coli K-12 into RegulonDB (http://regulondb.ccg.unam.mx), our database that is currently supported by >5000 papers. Here, we report a first step towards the automatic biocuration of growth conditions in this corpus. Using the OntoGene text-mining system (http://www.ontogene.org), we extracted and manually validated regulatory interactions and growth conditions in a new approach based on filters that enable the curator to select informative sentences from preprocessed full papers. Based on a set of 48 papers dealing with oxidative stress by OxyR, we were able to retrieve 100% of the OxyR regulatory interactions present in RegulonDB, including the transcription factors and their effect on target genes. Our strategy was designed to extract, as we did, their growth conditions. This result provides a proof of concept for a more direct and efficient curation process, and enables us to define the strategy of the subsequent steps to be implemented for a semi-automatic curation of original literature dealing with regulation of gene expression in bacteria. This project will enhance the efficiency and quality of the curation of knowledge present in the literature of gene regulation, and contribute to a significant increase in the encoding of the regulatory network of E. coli. RegulonDB Database URL: http://regulondb.ccg.unam.mx OntoGene URL: http://www.ontogene.org.


Asunto(s)
Minería de Datos , Escherichia coli K12/crecimiento & desarrollo , Escherichia coli K12/genética , Proteínas de Escherichia coli/metabolismo , Regulación Bacteriana de la Expresión Génica , Redes Reguladoras de Genes , Proteínas Represoras/metabolismo , Bases de Datos Genéticas , Escherichia coli K12/metabolismo , Regulón/genética , Semántica , Terminología como Asunto
8.
Database (Oxford) ; 2013: bas053, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23396322

RESUMEN

In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.


Asunto(s)
Minería de Datos/métodos , Anotación de Secuencia Molecular/métodos , Bases de Datos Genéticas , Almacenamiento y Recuperación de la Información , Toxicogenética , Interfaz Usuario-Computador
9.
J Biomed Semantics ; 3 Suppl 3: S5, 2012 Oct 05.
Artículo en Inglés | MEDLINE | ID: mdl-23046495

RESUMEN

BACKGROUND: One of the key pieces of information which biomedical text mining systems are expected to extract from the literature are interactions among different types of biomedical entities (proteins, genes, diseases, drugs, etc.). Several large resources of curated relations between biomedical entities are currently available, such as the Pharmacogenomics Knowledge Base (PharmGKB) or the Comparative Toxicogenomics Database (CTD).Biomedical text mining systems, and in particular those which deal with the extraction of relationships among entities, could make better use of the wealth of already curated material. RESULTS: We propose a simple and effective method based on logistic regression (also known as maximum entropy modeling) for an optimized ranking of relation candidates utilizing curated abstracts. Furthermore, we examine the effects and difficulties of using widely available metadata (i.e. MeSH terms and chemical substance index terms) for relation extraction. Cross-validation experiments result in an improvement of the ranking quality in terms of AUCiP/R by 39% (PharmGKB) and 116% (CTD) against a frequency-based baseline of 0.39 (PharmGKB) and 0.21 (CTD). For the TAP-10 metrics, we achieve an improvement of 53% (PharmGKB) and 134% (CTD) against the same baseline system (0.21 PharmGKB and 0.15 CTD). CONCLUSIONS: Our experiments with the PharmGKB and the CTD database show a strong positive effect for the ranking of relation candidates utilizing the vast amount of curated relations covered by currently available knowledge databases. The tasks of concept identification and candidate relation generation profit from the adaptation to previously curated material. This presents an effective and practical method suitable for conservative extension and re-validation of biomedical relations from texts that has been successfully used for curation experiments with the PharmGKB and CTD database.

10.
Database (Oxford) ; 2012: bas021, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22529178

RESUMEN

The need for efficient text-mining tools that support curation of the biomedical literature is ever increasing. In this article, we describe an experiment aimed at verifying whether a text-mining tool capable of extracting meaningful relationships among domain entities can be successfully integrated into the curation workflow of a major biological database. We evaluate in particular (i) the usability of the system's interface, as perceived by users, and (ii) the correlation of the ranking of interactions, as provided by the text-mining system, with the choices of the curators.


Asunto(s)
Minería de Datos/métodos , Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Farmacogenética , Indización y Redacción de Resúmenes , Investigación Biomédica , Reproducibilidad de los Resultados , Interfaz Usuario-Computador
11.
Artículo en Inglés | MEDLINE | ID: mdl-20671319

RESUMEN

We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge, where the OntoGene system achieved best-ranked results. Additionally, we describe a feature-analysis experiment performed after the challenge, which shows the unexpected result that one single feature alone performs better than the combination of features used in the challenge.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Mapeo de Interacción de Proteínas/métodos , Bases de Datos Bibliográficas , Procesamiento de Lenguaje Natural , Sociedades Científicas
12.
Genome Biol ; 9 Suppl 2: S13, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18834491

RESUMEN

BACKGROUND: Research scientists and companies working in the domains of biomedicine and genomics are increasingly faced with the problem of efficiently locating, within the vast body of published scientific findings, the critical pieces of information that are needed to direct current and future research investment. RESULTS: In this report we describe approaches taken within the scope of the second BioCreative competition in order to solve two aspects of this problem: detection of novel protein interactions reported in scientific articles, and detection of the experimental method that was used to confirm the interaction. Our approach to the former problem is based on a high-recall protein annotation step, followed by two strict disambiguation steps. The remaining proteins are then combined according to a number of lexico-syntactic filters, which deliver high-precision results while maintaining reasonable recall. The detection of the experimental methods is tackled by a pattern matching approach, which has delivered the best results in the official BioCreative evaluation. CONCLUSION: Although the results of BioCreative clearly show that no tool is sufficiently reliable for fully automated annotations, a few of the proposed approaches (including our own) already perform at a competitive level. This makes them interesting either as standalone tools for preliminary document inspection, or as modules within an environment aimed at supporting the process of curation of biomedical literature.


Asunto(s)
Biología Computacional/métodos , Genes , Sociedades Científicas , Indización y Redacción de Resúmenes , Internet , Mapeo de Interacción de Proteínas , Reproducibilidad de los Resultados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA