Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Mol Cell Proteomics ; 10(11): M111.010629, 2011 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-21836163

RESUMO

Information about the physical association of proteins is extensively used for studying cellular processes and disease mechanisms. However, complete experimental mapping of the human interactome will remain prohibitively difficult in the near future. Here we present a map of predicted human protein interactions that distinguishes functional association from physical binding. Our network classifies more than 5 million protein pairs predicting 94,009 new interactions with high confidence. We experimentally tested a subset of these predictions using yeast two-hybrid analysis and affinity purification followed by quantitative mass spectrometry. Thus we identified 462 new protein-protein interactions and confirmed the predictive power of the network. These independent experiments address potential issues of circular reasoning and are a distinctive feature of this work. Analysis of the physical interactome unravels subnetworks mediating between different functional and physical subunits of the cell. Finally, we demonstrate the utility of the network for the analysis of molecular mechanisms of complex diseases by applying it to genome-wide association studies of neurodegenerative diseases. This analysis provides new evidence implying TOMM40 as a factor involved in Alzheimer's disease. The network provides a high-quality resource for the analysis of genomic data sets and genetic association studies in particular. Our interactome is available via the hPRINT web server at: www.print-db.org.


Assuntos
Simulação por Computador , Modelos Moleculares , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Animais , Teorema de Bayes , Células HeLa , Humanos , Camundongos , Doenças Neurodegenerativas/genética , Doenças Neurodegenerativas/metabolismo , Domínios e Motivos de Interação entre Proteínas , Mapas de Interação de Proteínas , Proteoma/genética , Proteoma/metabolismo , Curva ROC , Proteínas Recombinantes/metabolismo , Estatísticas não Paramétricas
2.
Bioinformatics ; 27(19): 2769-71, 2011 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-21813477

RESUMO

SUMMARY: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. AVAILABILITY: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. CONTACT: jorg.hakenberg@roche.com.


Assuntos
Mineração de Dados , Biblioteca Gênica , Processamento Eletrônico de Dados , Genes , Internet , Proteínas , Editoração , Terminologia como Assunto
3.
Nucleic Acids Res ; 37(Web Server issue): W300-4, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19465383

RESUMO

High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.


Assuntos
Genes , Software , Animais , Reabsorção Óssea/genética , Perfilação da Expressão Gênica , Humanos , Camundongos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos , Osteoporose/genética , Neoplasias Pancreáticas/genética , Porfiria Hepatoeritropoética/genética , PubMed , Ratos , Vocabulário Controlado
4.
Brief Bioinform ; 9(6): 466-78, 2008 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19060303

RESUMO

The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Genes , Armazenamento e Recuperação da Informação/métodos , Indexação e Redação de Resumos , Animais , Bases de Dados Bibliográficas , Humanos , Conhecimento , Semântica
5.
BMC Bioinformatics ; 10 Suppl 8: S3, 2009 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-19758467

RESUMO

BACKGROUND: The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets. RESULTS: We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins. CONCLUSION: We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model.


Assuntos
Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Proteínas de Membrana/genética , Mutação , Algoritmos , Substituição de Aminoácidos , Animais , Bases de Dados Genéticas , Genes , Genômica , Humanos , Proteínas de Membrana/química , Modelos Genéticos , Reconhecimento Automatizado de Padrão , Publicações Periódicas como Assunto , Fenótipo , Mutação Puntual , Estabilidade Proteica , PubMed , Análise de Sequência
6.
Bioinformatics ; 24(16): i126-132, 2008 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-18689813

RESUMO

MOTIVATION: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. RESULTS: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes. AVAILABILITY: A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. GNAT will also be available within the BioCreativeMetaService project, see http://bcms.bioinfo.cnio.es. SUPPLEMENTARY INFORMATION: The test data set, lexica, and links toexternal data are available at http://cbioc.eas.asu.edu/gnat/


Assuntos
Algoritmos , Inteligência Artificial , Bases de Dados Genéticas , Genes/genética , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Software , Especificidade da Espécie
7.
Bioinformatics ; 22(19): 2444-5, 2006 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-16870931

RESUMO

UNLABELLED: The biomedical literature contains a wealth of information on associations between many different types of objects, such as protein-protein interactions, gene-disease associations and subcellular locations of proteins. When searching such information using conventional search engines, e.g. PubMed, users see the data only one-abstract at a time and 'hidden' in natural language text. AliBaba is an interactive tool for graphical summarization of search results. It parses the set of abstracts that fit a PubMed query and presents extracted information on biomedical objects and their relationships as a graphical network. AliBaba extracts associations between cells, diseases, drugs, proteins, species and tissues. Several filter options allow for a more focused search. Thus, researchers can grasp complex networks described in various articles at a glance. AVAILABILITY: http://alibaba.informatik.hu-berlin.de/


Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Gráficos por Computador , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Software , Interface Usuário-Computador , Sistemas de Gerenciamento de Base de Dados , Armazenamento e Recuperação da Informação/métodos , Vocabulário Controlado
8.
BMC Bioinformatics ; 6 Suppl 1: S9, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-15960843

RESUMO

In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.


Assuntos
Biologia Computacional/métodos , Genes , Reconhecimento Automatizado de Padrão/classificação , Reconhecimento Automatizado de Padrão/métodos , Reconhecimento Psicológico , Terminologia como Assunto
9.
Curr Pharm Biotechnol ; 12(3): 449-57, 2011 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-21133848

RESUMO

Huge volumes of data, produced by microarrays and next- generation sequencing, are now at the fingertips of scientists and allow to expand the scope beyond conventional drug de- sign. New promiscuous drugs directed at multiple targets promise increased therapeutic efficacy for treatment of multi- factorial diseases. At the same time, more systematic tests for unwanted side effects are now possible. In this paper, we focus on the application of text mining and ontologies to support experimental drug discovery. Text mining is a high- throughput technique to extract information from millions of scientific documents and web pages. By exploiting the vast number of extracted facts as well as the indirect links between them, text mining and ontologies help to generate new hypotheses on drug target interactions. We review latest applications of text mining and ontologies suitable for target and drug-target interaction discovery in addition to conventional approaches. We conclude that mining the literature on drugs and proteins offers unique opportunities to support the laborious and expensive process of drug development.


Assuntos
Mineração de Dados/métodos , Bases de Dados Factuais , Descoberta de Drogas , Ensaios de Triagem em Larga Escala , Armazenamento e Recuperação da Informação , Biologia Computacional , Bases de Dados Bibliográficas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Internet , Preparações Farmacêuticas/metabolismo
10.
Genome Biol ; 9 Suppl 2: S14, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18834492

RESUMO

BACKGROUND: The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins. RESULTS: We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4% (micro-average) in the BioCreative II interaction pair subtask. CONCLUSION: For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages. AVAILABILITY: Our methods for gene, protein, and species identification, and extraction of protein-protein are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.


Assuntos
Genes , Armazenamento e Recuperação da Informação , Modelos Biológicos , Mapeamento de Interação de Proteínas , Semântica , Animais , Drosophila , Camundongos , Alinhamento de Sequência
11.
Genome Biol ; 9 Suppl 2: S6, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18834497

RESUMO

We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; http://bcms.bioinfo.cnio.es/). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations.


Assuntos
Pesquisa Biomédica/métodos , Biologia Computacional/métodos , Armazenamento e Recuperação da Informação , Internet , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA