Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Bioinformatics ; 32(2): 276-82, 2016 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-26428294

RESUMEN

MOTIVATION: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. RESULTS: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. AVAILABILITY AND IMPLEMENTATION: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. CONTACT: sukaew@utu.fi.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Factuales , Genes Letales , Neoplasias/patología , Terminología como Asunto , Inteligencia Artificial , Línea Celular , Biología Computacional/métodos , Humanos , Almacenamiento y Recuperación de la Información , Aprendizaje Automático , Neoplasias/genética , Publicaciones , Semántica
2.
BMC Bioinformatics ; 17: 18, 2016 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-26729218

RESUMEN

BACKGROUND: Differential networks have recently been introduced as a powerful way to study the dynamic rewiring capabilities of an interactome in response to changing environmental conditions or stimuli. Currently, such differential networks are generated and visualised using ad hoc methods, and are often limited to the analysis of only one condition-specific response or one interaction type at a time. RESULTS: In this work, we present a generic, ontology-driven framework to infer, visualise and analyse an arbitrary set of condition-specific responses against one reference network. To this end, we have implemented novel ontology-based algorithms that can process highly heterogeneous networks, accounting for both physical interactions and regulatory associations, symmetric and directed edges, edge weights and negation. We propose this integrative framework as a standardised methodology that allows a unified view on differential networks and promotes comparability between differential network studies. As an illustrative application, we demonstrate its usefulness on a plant abiotic stress study and we experimentally confirmed a predicted regulator. AVAILABILITY: Diffany is freely available as open-source java library and Cytoscape plugin from http://bioinformatics.psb.ugent.be/supplementary_data/solan/diffany/.


Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Modelos Moleculares , Algoritmos , Arabidopsis/química , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Proteínas de Unión al Calcio/genética , Proteínas de Unión al Calcio/metabolismo , Proteínas de Unión al ADN/genética , Proteínas de Unión al ADN/metabolismo , Regulación de la Expresión Génica de las Plantas , Biología de Sistemas , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
3.
Plant Cell ; 25(3): 794-807, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-23532071

RESUMEN

Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.


Asunto(s)
Arabidopsis/genética , Minería de Datos/métodos , Genes de Plantas , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Ciclo Celular , Biología Computacional/métodos , Familia de Multigenes , Mapeo de Interacción de Proteínas , Mapas de Interacción de Proteínas , PubMed , Reproducibilidad de los Resultados
4.
BMC Bioinformatics ; 16 Suppl 16: S3, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26551766

RESUMEN

BACKGROUND: Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems. RESULTS: In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge. CONCLUSIONS: For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates.


Asunto(s)
Minería de Datos , Redes Reguladoras de Genes , Anotación de Secuencia Molecular , Procesamiento de Lenguaje Natural
5.
Plant Physiol ; 165(2): 519-527, 2014 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-24710067

RESUMEN

In vitro stress assays are commonly used to study the responses of plants to abiotic stress and to assess stress tolerance. A literature review reveals that most studies use very high stress levels and measure criteria such as germination, plant survival, or the development of visual symptoms such as bleaching. However, we show that these parameters are indicators of very severe stress, and such studies thus only provide incomplete information about stress sensitivity in Arabidopsis (Arabidopsis thaliana). Similarly, transcript analysis revealed that typical stress markers are only induced at high stress levels in young seedlings. Therefore, tools are needed to study the effects of mild stress. We found that the commonly used stress-inducing agents mannitol, sorbitol, NaCl, and hydrogen peroxide impact shoot growth in a highly specific and dose-dependent way. Therefore, shoot growth is a sensitive, relevant, and easily measured phenotype to assess stress tolerance over a wide range of stress levels. Finally, our data suggest that care should be taken when using mannitol as an osmoticum.

6.
Environ Microbiol ; 15(8): 2147-53, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-23826978

RESUMEN

With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. To illustrate the versatility of the platform, different case studies are presented demonstrating how pico-PLAZA can be used to functionally characterize large-scale EST/RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylum tricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains.


Asunto(s)
Bases de Datos Genéticas/normas , Bases de Datos Genéticas/tendencias , Eucariontes/genética , Genómica , Chlorophyta/genética , Código de Barras del ADN Taxonómico , Diatomeas/genética , Variación Genética , Genoma de Planta/genética
7.
BMC Bioinformatics ; 13 Suppl 11: S6, 2012 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-22759460

RESUMEN

BACKGROUND: Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts. RESULTS: We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score > 90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts. CONCLUSIONS: The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.


Asunto(s)
Minería de Datos , Genes , Inteligencia Artificial , Bases de Datos Factuales , PubMed
8.
BMC Bioinformatics ; 12: 481, 2011 Dec 18.
Artículo en Inglés | MEDLINE | ID: mdl-22177292

RESUMEN

BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.


Asunto(s)
Minería de Datos , Sistemas de Computación , Publicaciones Periódicas como Asunto , Programas Informáticos
9.
Bioinformatics ; 26(18): i554-60, 2010 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-20823321

RESUMEN

MOTIVATION: In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection (FS) is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results. RESULTS: We show that our FS methodology successfully discards a large fraction of machine-generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how FS can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs. Finally, we discuss a number of insights from our FS analyses that will provide the opportunity to considerably improve upon current text mining tools. AVAILABILITY: The FS algorithms and classifiers are available in Java-ML (http://java-ml.sf.net). The datasets are publicly available from the BioNLP'09 Shared Task web site (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).


Asunto(s)
Algoritmos , Minería de Datos/métodos , Inteligencia Artificial , Expresión Génica , Metabolismo , Fosforilación , Unión Proteica , Transcripción Genética
10.
PLoS One ; 8(4): e55814, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23613707

RESUMEN

Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons - Attribution - Share Alike (CC BY-SA) license.


Asunto(s)
Minería de Datos , Genes , Publicaciones , Algoritmos , Familia de Multigenes , Estándares de Referencia , Transducción de Señal/genética , Estadística como Asunto
11.
Adv Bioinformatics ; 2012: 582765, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22719757

RESUMEN

Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA