Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Mol Cell Proteomics ; 10(11): M111.010629, 2011 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-21836163

RESUMEN

Information about the physical association of proteins is extensively used for studying cellular processes and disease mechanisms. However, complete experimental mapping of the human interactome will remain prohibitively difficult in the near future. Here we present a map of predicted human protein interactions that distinguishes functional association from physical binding. Our network classifies more than 5 million protein pairs predicting 94,009 new interactions with high confidence. We experimentally tested a subset of these predictions using yeast two-hybrid analysis and affinity purification followed by quantitative mass spectrometry. Thus we identified 462 new protein-protein interactions and confirmed the predictive power of the network. These independent experiments address potential issues of circular reasoning and are a distinctive feature of this work. Analysis of the physical interactome unravels subnetworks mediating between different functional and physical subunits of the cell. Finally, we demonstrate the utility of the network for the analysis of molecular mechanisms of complex diseases by applying it to genome-wide association studies of neurodegenerative diseases. This analysis provides new evidence implying TOMM40 as a factor involved in Alzheimer's disease. The network provides a high-quality resource for the analysis of genomic data sets and genetic association studies in particular. Our interactome is available via the hPRINT web server at: www.print-db.org.


Asunto(s)
Simulación por Computador , Modelos Moleculares , Mapeo de Interacción de Proteínas/métodos , Algoritmos , Animales , Teorema de Bayes , Células HeLa , Humanos , Ratones , Enfermedades Neurodegenerativas/genética , Enfermedades Neurodegenerativas/metabolismo , Dominios y Motivos de Interacción de Proteínas , Mapas de Interacción de Proteínas , Proteoma/genética , Proteoma/metabolismo , Curva ROC , Proteínas Recombinantes/metabolismo , Estadísticas no Paramétricas
2.
Bioinformatics ; 27(19): 2769-71, 2011 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-21813477

RESUMEN

SUMMARY: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. AVAILABILITY: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. CONTACT: jorg.hakenberg@roche.com.


Asunto(s)
Minería de Datos , Biblioteca de Genes , Procesamiento Automatizado de Datos , Genes , Internet , Proteínas , Edición , Terminología como Asunto
3.
Nucleic Acids Res ; 37(Web Server issue): W300-4, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19465383

RESUMEN

High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.


Asunto(s)
Genes , Programas Informáticos , Animales , Resorción Ósea/genética , Perfilación de la Expresión Génica , Humanos , Ratones , Mutación , Análisis de Secuencia por Matrices de Oligonucleótidos , Osteoporosis/genética , Neoplasias Pancreáticas/genética , Porfiria Hepatoeritropoyética/genética , PubMed , Ratas , Vocabulario Controlado
4.
Brief Bioinform ; 9(6): 466-78, 2008 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-19060303

RESUMEN

The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Genes , Almacenamiento y Recuperación de la Información/métodos , Indización y Redacción de Resúmenes , Animales , Bases de Datos Bibliográficas , Humanos , Conocimiento , Semántica
5.
BMC Bioinformatics ; 10 Suppl 8: S3, 2009 Aug 27.
Artículo en Inglés | MEDLINE | ID: mdl-19758467

RESUMEN

BACKGROUND: The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets. RESULTS: We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins. CONCLUSION: We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model.


Asunto(s)
Biología Computacional/métodos , Almacenamiento y Recuperación de la Información/métodos , Proteínas de la Membrana/genética , Mutación , Algoritmos , Sustitución de Aminoácidos , Animales , Bases de Datos Genéticas , Genes , Genómica , Humanos , Proteínas de la Membrana/química , Modelos Genéticos , Reconocimiento de Normas Patrones Automatizadas , Publicaciones Periódicas como Asunto , Fenotipo , Mutación Puntual , Estabilidad Proteica , PubMed , Análisis de Secuencia
6.
Bioinformatics ; 24(16): i126-132, 2008 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-18689813

RESUMEN

MOTIVATION: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. RESULTS: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes. AVAILABILITY: A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. GNAT will also be available within the BioCreativeMetaService project, see http://bcms.bioinfo.cnio.es. SUPPLEMENTARY INFORMATION: The test data set, lexica, and links toexternal data are available at http://cbioc.eas.asu.edu/gnat/


Asunto(s)
Algoritmos , Inteligencia Artificial , Bases de Datos Genéticas , Genes/genética , Almacenamiento y Recuperación de la Información/métodos , MEDLINE , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Programas Informáticos , Especificidad de la Especie
7.
Bioinformatics ; 22(19): 2444-5, 2006 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-16870931

RESUMEN

UNLABELLED: The biomedical literature contains a wealth of information on associations between many different types of objects, such as protein-protein interactions, gene-disease associations and subcellular locations of proteins. When searching such information using conventional search engines, e.g. PubMed, users see the data only one-abstract at a time and 'hidden' in natural language text. AliBaba is an interactive tool for graphical summarization of search results. It parses the set of abstracts that fit a PubMed query and presents extracted information on biomedical objects and their relationships as a graphical network. AliBaba extracts associations between cells, diseases, drugs, proteins, species and tissues. Several filter options allow for a more focused search. Thus, researchers can grasp complex networks described in various articles at a glance. AVAILABILITY: http://alibaba.informatik.hu-berlin.de/


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Algoritmos , Gráficos por Computador , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Programas Informáticos , Interfaz Usuario-Computador , Sistemas de Administración de Bases de Datos , Almacenamiento y Recuperación de la Información/métodos , Vocabulario Controlado
8.
BMC Bioinformatics ; 6 Suppl 1: S9, 2005.
Artículo en Inglés | MEDLINE | ID: mdl-15960843

RESUMEN

In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.


Asunto(s)
Biología Computacional/métodos , Genes , Reconocimiento de Normas Patrones Automatizadas/clasificación , Reconocimiento de Normas Patrones Automatizadas/métodos , Reconocimiento en Psicología , Terminología como Asunto
9.
Curr Pharm Biotechnol ; 12(3): 449-57, 2011 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-21133848

RESUMEN

Huge volumes of data, produced by microarrays and next- generation sequencing, are now at the fingertips of scientists and allow to expand the scope beyond conventional drug de- sign. New promiscuous drugs directed at multiple targets promise increased therapeutic efficacy for treatment of multi- factorial diseases. At the same time, more systematic tests for unwanted side effects are now possible. In this paper, we focus on the application of text mining and ontologies to support experimental drug discovery. Text mining is a high- throughput technique to extract information from millions of scientific documents and web pages. By exploiting the vast number of extracted facts as well as the indirect links between them, text mining and ontologies help to generate new hypotheses on drug target interactions. We review latest applications of text mining and ontologies suitable for target and drug-target interaction discovery in addition to conventional approaches. We conclude that mining the literature on drugs and proteins offers unique opportunities to support the laborious and expensive process of drug development.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Factuales , Descubrimiento de Drogas , Ensayos Analíticos de Alto Rendimiento , Almacenamiento y Recuperación de la Información , Biología Computacional , Bases de Datos Bibliográficas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Humanos , Internet , Preparaciones Farmacéuticas/metabolismo
10.
Genome Biol ; 9 Suppl 2: S14, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18834492

RESUMEN

BACKGROUND: The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins. RESULTS: We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4% (micro-average) in the BioCreative II interaction pair subtask. CONCLUSION: For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages. AVAILABILITY: Our methods for gene, protein, and species identification, and extraction of protein-protein are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.


Asunto(s)
Genes , Almacenamiento y Recuperación de la Información , Modelos Biológicos , Mapeo de Interacción de Proteínas , Semántica , Animales , Drosophila , Ratones , Alineación de Secuencia
11.
Genome Biol ; 9 Suppl 2: S6, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18834497

RESUMEN

We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; http://bcms.bioinfo.cnio.es/). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations.


Asunto(s)
Investigación Biomédica/métodos , Biología Computacional/métodos , Almacenamiento y Recuperación de la Información , Internet , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA