|

Rapid identification of PAX2/5/8 direct downstream targets in the otic vesicle by combinatorial use of bioinformatics tools.

Ramialison, Mirana; Bajoghli, Baubak; Aghaallaei, Narges; Ettwiller, Laurence; Gaudan, Sylvain; Wittbrodt, Beate; Czerny, Thomas; Wittbrodt, Joachim.

Genome Biol ; 9(10): R145, 2008 Oct 01.

Article En | MEDLINE | ID: mdl-18828907

BACKGROUND: The pax2/5/8 genes belonging to the PAX family of transcription factors are key developmental regulators that are involved in the patterning of various embryonic tissues. More particularly, their function in inner ear specification has been widely described. However, little is known about the direct downstream targets and, so far, no global approaches have been performed to identify these target genes in this particular tissue. RESULTS: Here we present an original bioinformatics pipeline composed of comparative genomics, database querying and text mining tools, which is designed to rapidly and specifically discover PAX2/5/8 direct downstream targets involved in inner ear development. We provide evidence supported by experimental validation in medaka fish that brain 2 (POU domain, class 3, transcription factor 2), claudin-7, secretory pathway component sec31-like and meteorin-like precursor are novel direct downstream targets of PAX2/5/8. CONCLUSIONS: This study illustrates the power of extensive mining of public data repositories using bioinformatics methods to provide answers for a specific biological question. It furthermore demonstrates how the usage of such a combinatorial approach is advantageous for the biologist in terms of experimentation time and costs.

Computational Biology/methods , Ear, Inner/metabolism , PAX2 Transcription Factor/metabolism , PAX5 Transcription Factor/metabolism , Paired Box Transcription Factors/metabolism , Animals , Data Mining , Databases, Genetic , Ear, Inner/embryology , Fish Proteins/genetics , Fish Proteins/metabolism , Fishes/metabolism , Genomics , Humans , PAX2 Transcription Factor/genetics , PAX5 Transcription Factor/genetics , Paired Box Transcription Factors/genetics

Integrating protein-protein interactions and text mining for protein function prediction.

Jaeger, Samira; Gaudan, Sylvain; Leser, Ulf; Rebholz-Schuhmann, Dietrich.

BMC Bioinformatics ; 9 Suppl 8: S2, 2008 Jul 22.

Article En | MEDLINE | ID: mdl-18673526

BACKGROUND: Functional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature. RESULTS: Using this procedure, more than 80% of the GO annotations for proteins with highly conserved orthologs that are available in UniProtKb/Swiss-Prot could be verified automatically. For a subset of proteins we predicted new GO annotations that were not available in UniProtKb/Swiss-Prot. All predictions were correct (100% precision) according to the verifications from a trained curator. CONCLUSION: Our method of integrating CCSs and literature mining is thus a highly reliable approach to predict GO annotations for weakly characterized proteins with orthologs.

Computational Biology/methods , Data Mining/methods , Databases, Protein , Proteins/chemistry , Proteins/physiology , Algorithms , Reproducibility of Results , Terminology as Topic

Assessment of disease named entity recognition on a corpus of annotated sentences.

Jimeno, Antonio; Jimenez-Ruiz, Ernesto; Lee, Vivian; Gaudan, Sylvain; Berlanga, Rafael; Rebholz-Schuhmann, Dietrich.

BMC Bioinformatics ; 9 Suppl 3: S3, 2008 Apr 11.

Article En | MEDLINE | ID: mdl-18426548

BACKGROUND: In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions. RESULTS: As part of our research work, we have taken a corpus that has been delivered in the past for the identification of associations of genes to diseases based on the UMLS Metathesaurus and we have reprocessed and re-annotated the corpus. We have gathered annotations for disease entities from two curators, analyzed their disagreement (0.51 in the kappa-statistic) and composed a single annotated corpus for public use. Thereafter, three solutions for disease named entity recognition including MetaMap have been applied to the corpus to automatically annotate it with UMLS Metathesaurus concepts. The resulting annotations have been benchmarked to compare their performance. CONCLUSIONS: The annotated corpus is publicly available at ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases and can serve as a benchmark to other systems. In addition, we found that dictionary look-up already provides competitive results indicating that the use of disease terminology is highly standardized throughout the terminologies and the literature. MetaMap generates precise results at the expense of insufficient recall while our statistical method obtains better recall at a lower precision rate. Even better results in terms of precision are achieved by combining at least two of the three methods leading, but this approach again lowers recall. Altogether, our analysis gives a better understanding of the complexity of disease annotations in the literature. MetaMap and the dictionary based approach are available through the Whatizit web service infrastructure (Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: Calling Whatizit. Bioinformatics 2008, 24:296-298).

Algorithms , Artificial Intelligence , Disease/classification , Natural Language Processing , Pattern Recognition, Automated/methods , Terminology as Topic , Unified Medical Language System , Dictionaries as Topic , Semantics , Vocabulary, Controlled

Text processing through Web services: calling Whatizit.

Rebholz-Schuhmann, Dietrich; Arregui, Miguel; Gaudan, Sylvain; Kirsch, Harald; Jimeno, Antonio.

Bioinformatics ; 24(2): 296-8, 2008 Jan 15.

Article En | MEDLINE | ID: mdl-18006544

MOTIVATION: Text-mining (TM) solutions are developing into efficient services to researchers in the biomedical research community. Such solutions have to scale with the growing number and size of resources (e.g. available controlled vocabularies), with the amount of literature to be processed (e.g. about 17 million documents in PubMed) and with the demands of the user community (e.g. different methods for fact extraction). These demands motivated the development of a server-based solution for literature analysis. Whatizit is a suite of modules that analyse text for contained information, e.g. any scientific publication or Medline abstracts. Special modules identify terms and then link them to the corresponding entries in bioinformatics databases such as UniProtKb/Swiss-Prot data entries and gene ontology concepts. Other modules identify a set of selected annotation types like the set produced by the EBIMed analysis pipeline for proteins. In the case of Medline abstracts, Whatizit offers access to EBI's in-house installation via PMID or term query. For large quantities of the user's own text, the server can be operated in a streaming mode (http://www.ebi.ac.uk/webservices/whatizit).

Database Management Systems , Internet , MEDLINE , Natural Language Processing , Periodicals as Topic , Software , User-Computer Interface , Artificial Intelligence , Information Storage and Retrieval/methods , Vocabulary, Controlled

EBIMed--text crunching to gather facts for proteins from Medline.

Rebholz-Schuhmann, Dietrich; Kirsch, Harald; Arregui, Miguel; Gaudan, Sylvain; Riethoven, Mark; Stoehr, Peter.

Bioinformatics ; 23(2): e237-44, 2007 Jan 15.

Article En | MEDLINE | ID: mdl-17237098

UNLABELLED: To allow efficient and systematic retrieval of statements from Medline we have developed EBIMed, a service that combines document retrieval with co-occurrence-based analysis of Medline abstracts. Upon keyword query, EBIMed retrieves the abstracts from EMBL-EBI's installation of Medline and filters for sentences that contain biomedical terminology maintained in public bioinformatics resources. The extracted sentences and terminology are used to generate an overview table on proteins, Gene Ontology (GO) annotations, drugs and species used in the same biological context. All terms in retrieved abstracts and extracted sentences are linked to their entries in biomedical databases. We assessed the quality of the identification of terms and relations in the retrieved sentences. More than 90% of the protein names found indeed represented a protein. According to the analysis of four protein-protein pairs from the Wnt pathway we estimated that 37% of the statements containing such a pair mentioned a meaningful interaction and clarified the interaction of Dkk with LRP. We conclude that EBIMed improves access to information where proteins and drugs are involved in the same biological process, e.g. statements with GO annotations of proteins, protein-protein interactions and effects of drugs on proteins. AVAILABILITY: Available at http://www.ebi.ac.uk/Rebholz-srv/ebimed

Abstracting and Indexing/methods , Information Storage and Retrieval/methods , MEDLINE , Natural Language Processing , Proteins/classification , Software , Terminology as Topic , Algorithms , Artificial Intelligence , Database Management Systems , Proteins/chemistry , Proteins/genetics , Proteins/metabolism

Protein annotation by EBIMed.

Rebholz-Schuhmann, Dietrich; Kirsch, Harald; Arregui, Miguel; Gaudan, Sylvain; Rynbeek, Mark; Stoehr, Peter.

Nat Biotechnol ; 24(8): 902-3, 2006 Aug.

Article En | MEDLINE | ID: mdl-16900125

Abstracting and Indexing/methods , Artificial Intelligence , Information Storage and Retrieval/methods , MEDLINE , Natural Language Processing , Periodicals as Topic , User-Computer Interface , Database Management Systems , Vocabulary, Controlled

Distributed modules for text annotation and IE applied to the biomedical domain.

Kirsch, Harald; Gaudan, Sylvain; Rebholz-Schuhmann, Dietrich.

Int J Med Inform ; 75(6): 496-500, 2006 Jun.

Article En | MEDLINE | ID: mdl-16085453

Biological databases contain facts from scientific literature that have been curated by hand to ensure high quality. Curation is time-consuming and can be supported by information extraction methods. We present a server software infrastructure which allows to easily plug in modules to identify biologically interesting pieces of text to be then presented in a web interface to the curator. There are modules which identify UniProt, UMLS and GO terminology, gene and protein names, mutations and protein-protein interactions. UniProt, UMLS and GO concepts are automatically linked to the original source. The module for mutations is based on syntax patterns and the one for protein-protein interactions relies on chunk parsing. All modules work as separate servers possibly distributed on different machines and can be combined into processing pipelines as necessary. Communication is based on XML annotated text streams, each server processing the XML elements it is designed for, and possibly adding more information in the form of XML annotation. The server and the underlying software are available to the public.

Abstracting and Indexing/methods , Database Management Systems , Information Storage and Retrieval/methods , Natural Language Processing , Periodicals as Topic , Software , User-Computer Interface , Artificial Intelligence , Biology , Documentation/methods , Medicine , Semantics , Terminology as Topic , Vocabulary, Controlled

Computer-assisted generation of a protein-interaction database for nuclear receptors.

Albert, Sylvie; Gaudan, Sylvain; Knigge, Heidrun; Raetsch, Andreas; Delgado, Asuncion; Huhse, Bettina; Kirsch, Harald; Albers, Michael; Rebholz-Schuhmann, Dietrich; Koegl, Manfred.

Mol Endocrinol ; 17(8): 1555-67, 2003 Aug.

Article En | MEDLINE | ID: mdl-12738764

With the increasing amount of biological data available, automated methods for information retrieval become necessary. We employed computer-assisted text mining to retrieve all protein-protein interactions for nuclear receptors from MEDLINE in a systematic way. A dictionary of protein names and of terms denoting interactions was generated, and trioccurrences of two protein names and one interaction term in one sentence were retrieved. Abstracts containing at least one such trioccurrence were manually checked by biologists to select the relevant interactions out of the automatically extracted data. In total, 4360 abstracts were retrieved containing data on protein interactions for nuclear receptors. The resulting database contains all reported protein interactions involving nuclear receptors from 1966 to September 2001. Remarkably, the annual increase in number of reported interactors for nuclear receptors has been following an exponential growth curve in the years 1991 to 2001. Apparent in the data set is the high complexity of protein interactions for nuclear receptors. The number of interactions correlates with the number of published papers for a given receptor, suggesting that the number of reported interactors is a reflection of the intensity of research dedicated to a given receptor. Indeed, comparison of the retrieved data to a systematic yeast two-hybrid-based interaction analysis suggests that most NRs are similar with respect to the number of interacting proteins. The data set obtained serves as a source for information on NR interactions, as well as a reference data set for the improvement of advanced text-mining methods.

Databases, Protein , MEDLINE , Protein Interaction Mapping , Receptors, Cytoplasmic and Nuclear , Receptors, Cytoplasmic and Nuclear/metabolism , Computers , Information Storage and Retrieval , Receptors, Cytoplasmic and Nuclear/genetics , Two-Hybrid System Techniques