Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 75
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
2.
Bioinformatics ; 24(17): 1935-41, 2008 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-18593717

RESUMO

MOTIVATION: Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed. METHODS: PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D. RESULTS: The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents. AVAILABILITY: Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/


Assuntos
Inteligência Artificial , Análise por Conglomerados , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , PubMed , Software , Algoritmos , Sistemas de Gerenciamento de Base de Dados
3.
Nucleic Acids Res ; 30(7): 1575-84, 2002 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-11917018

RESUMO

Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.


Assuntos
Algoritmos , Bases de Dados de Proteínas , Proteínas/genética , Sequência de Aminoácidos , Genoma Humano , Humanos , Internet , Dados de Sequência Molecular , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos , Fator de Transcrição TFIIB , Fatores de Transcrição/genética
4.
Nucleic Acids Res ; 29(21): 4395-404, 2001 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-11691927

RESUMO

Whole-genome clustering of the two available genome sequences of Helicobacter pylori strains 26695 and J99 allows the detection of 110 and 52 strain-specific genes, respectively. This set of strain-specific genes was compared with the sets obtained with other computational approaches of direct genome comparison as well as experimental data from microarray analysis. A considerable number of novel function assignments is possible using database-driven sequence annotation, although the function of the majority of the identified genes remains unknown. Using whole-genome clustering, it is also possible to detect species-specific genes by comparing the two H.pylori strains against the genome sequence of Campylobacter jejuni. It is interesting that the majority of strain-specific genes appear to be species specific. Finally, we introduce a novel approach to gene position analysis by employing measures from directional statistics. We show that although the two strains exhibit differences with respect to strain-specific gene distributions, this is due to the extensive genome rearrangements. If these are taken into account, a common pattern for the genome dynamics of the two Helicobacter strains emerges, suggestive of certain spatial constraints that may act as control mechanisms of gene flux.


Assuntos
Evolução Molecular , Genes Bacterianos/genética , Genoma Bacteriano , Genômica , Helicobacter pylori/classificação , Helicobacter pylori/genética , Sequência de Aminoácidos , Proteínas de Bactérias/química , Proteínas de Bactérias/classificação , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Campylobacter jejuni/genética , Biologia Computacional , Bases de Dados de Proteínas , Ordem dos Genes/genética , Internet , Modelos Genéticos , Dados de Sequência Molecular , Alinhamento de Sequência , Especificidade da Espécie
5.
Bioinformatics ; 17(9): 853-4, 2001 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-11590107

RESUMO

UNLABELLED: Graph layout is extensively used in the field of mathematics and computer science, however these ideas and methods have not been extended in a general fashion to the construction of graphs for biological data. To this end, we have implemented a version of the Fruchterman Rheingold graph layout algorithm, extensively modified for the purpose of similarity analysis in biology. This algorithm rapidly and effectively generates clear two (2D) or three-dimensional (3D) graphs representing similarity relationships such as protein sequence similarity. The implementation of the algorithm is general and applicable to most types of similarity information for biological data. AVAILABILITY: BioLayout is available for most UNIX platforms at the following web-site: http://www.ebi.ac.uk/research/cgg/services/layout.


Assuntos
Algoritmos , Gráficos por Computador , Sequência de Aminoácidos , Gráficos por Computador/estatística & dados numéricos , Gráficos por Computador/tendências , Bases de Dados de Proteínas/estatística & dados numéricos , Bases de Dados de Proteínas/tendências , Processamento de Imagem Assistida por Computador/estatística & dados numéricos , Processamento de Imagem Assistida por Computador/tendências , Imageamento Tridimensional/estatística & dados numéricos , Imageamento Tridimensional/tendências , Software/estatística & dados numéricos , Software/tendências
6.
Genome Res ; 11(9): 1503-10, 2001 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-11544193

RESUMO

We have analyzed the known metabolic enzymes of Escherichia coli in relation to their biochemical reaction properties and their involvement in biochemical pathways. All enzymes involved in small-molecule metabolism and their corresponding protein sequences have been extracted from the EcoCyc database. These 548 metabolic enzymes are clustered into 405 protein families according to sequence similarity. In this study, we examine the functional versatility within enzyme families in terms of their reaction capabilities and pathway participation. In addition, we examine the molecular diversity of reactions and pathways according to their presence across enzyme families. These complex, many-to-many relationships between protein sequence and biochemical function reveal a significant degree of correlation between enzyme families and reactions. Pathways, however, appear to require more than one enzyme type to perform their complex biochemical transformations. Finally, the distribution of enzyme family members across different pathways provides support for the "recruitment" hypothesis of biochemical pathway evolution.


Assuntos
Enzimas/fisiologia , Escherichia coli/enzimologia , Escherichia coli/genética , Família Multigênica , Sequência de Aminoácidos , Biologia Computacional , Bases de Dados Factuais , Enzimas/genética , Enzimas/metabolismo , Variação Genética , Dados de Sequência Molecular , Alinhamento de Sequência , Relação Estrutura-Atividade
7.
Pac Symp Biocomput ; : 384-95, 2001.
Artigo em Inglês | MEDLINE | ID: mdl-11262957

RESUMO

We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization. The method is flexible and robust, controlled by a small number of parameter values. Experiments show that the resulting document clusters are meaningful as assessed by cluster-specific terms. Despite the statistical nature of the approach, with minimal semantic analysis, the terms provide a shallow description of the document corpus and support concept discovery.


Assuntos
Indexação e Redação de Resumos , Algoritmos , MEDLINE , Biologia Molecular , Animais , Inteligência Artificial , Análise por Conglomerados , Drosophila/embriologia , Terminologia como Assunto
8.
Nucleic Acids Res ; 29(7): 1608-15, 2001 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-11266564

RESUMO

The global amino acid compositions as deduced from the complete genomic sequences of six thermophilic archaea, two thermophilic bacteria, 17 mesophilic bacteria and two eukaryotic species were analysed by hierarchical clustering and principal components analysis. Both methods showed an influence of several factors on amino acid composition. Although GC content has a dominant effect, thermophilic species can be identified by their global amino acid compositions alone. This study presents a careful statistical analysis of factors that affect amino acid composition and also yielded specific features of the average amino acid composition of thermophilic species. Moreover, we introduce the first example of a 'compositional tree' of species that takes into account not only homologous proteins, but also proteins unique to particular species. We expect this simple yet novel approach to be a useful additional tool for the study of phylogeny at the genome level.


Assuntos
Aminoácidos/genética , Genoma , Aminoácidos/química , Animais , Archaea/genética , Bactérias/genética , Caenorhabditis elegans/genética , Bases de Dados Factuais , Genoma Arqueal , Genoma Bacteriano , Genoma Fúngico , Filogenia , Saccharomyces cerevisiae/genética , Especificidade da Espécie , Temperatura
9.
Genome Biol ; 2(1): INTERACTIONS0001, 2001.
Artigo em Inglês | MEDLINE | ID: mdl-11178275

RESUMO

To assess how automatic function assignment will contribute to genome annotation in the next five years, we have performed an analysis of 31 available genome sequences. An emerging pattern is that function can be predicted for almost two-thirds of the 73,500 genes that were analyzed. Despite progress in computational biology, there will always be a great need for large-scale experimental determination of protein function.


Assuntos
Genoma , Análise de Sequência de DNA , Animais , Genoma Humano , Genômica/métodos , Genômica/tendências , Humanos , Proteoma , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/tendências
10.
Bioinformatics ; 17(1): 95-7, 2001 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-11222266

RESUMO

The mechanisms controlling gene regulation appear to be fundamentally different in eukaryotes and prokaryotes (Struhl (1999) CELL, 98, 1-4). To investigate this diversity further, we have analysed the distribution of all known transcription-associated proteins (TAPs), as reflected by sequence database annotations. Our results for the primary phylogenetic domains (Archaea, Bacteria and Eukaryota) show that TAP families are mostly taxon-specific and very few transcriptional regulators are common across these domains.


Assuntos
Biologia Computacional , Proteínas/genética , Bases de Dados Factuais , Filogenia , Proteínas/classificação , Fatores de Transcrição/classificação , Fatores de Transcrição/genética , Transcrição Gênica
11.
Genome Biol ; 2(9): RESEARCH0034, 2001.
Artigo em Inglês | MEDLINE | ID: mdl-11820254

RESUMO

BACKGROUND: It has recently been shown that the detection of gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interaction or complex formation. To obtain such predictions we have made an exhaustive search for gene fusion events within 24 available completely sequenced genomes. RESULTS: Each genome was used as a query against the remaining 23 complete genomes to detect gene fusion events. Using an improved, fully automatic protocol, a total of 7,224 single-domain proteins that are components of gene fusions in other genomes were detected, many of which were identified for the first time. The total number of predicted pairwise functional associations is 39,730 for all genomes. Component pairs were identified by virtue of their similarity to 2,365 multidomain composite proteins. We also show for the first time that gene fusion is a complex evolutionary process with a number of contributory factors, including paralogy, genome size and phylogenetic distance. On average, 9% of genes in a given genome appear to code for single-domain, component proteins predicted to be functionally associated. These proteins are detected by an additional 4% of genes that code for fused, composite proteins. CONCLUSIONS: These results provide an exhaustive set of functionally associated genes and also delineate the power of fusion analysis for the prediction of protein interactions.


Assuntos
Fusão Gênica Artificial , Evolução Molecular , Genoma , Proteínas/genética , Proteínas/metabolismo , Recombinação Genética/genética , Algoritmos , Animais , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Biologia Computacional/métodos , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Proteínas Fúngicas/genética , Proteínas Fúngicas/metabolismo , Perfilação da Expressão Gênica , Família Multigênica/genética , Filogenia , Ligação Proteica , Proteínas Recombinantes de Fusão/genética , Proteínas Recombinantes/genética , Reprodutibilidade dos Testes , Técnicas do Sistema de Duplo-Híbrido
12.
RNA ; 7(12): 1693-701, 2001 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-11780626

RESUMO

Domains rich in alternating arginine and serine residues (RS domains) are frequently found in metazoan proteins involved in pre-mRNA splicing. The RS domains of splicing factors associate with each other and are important for the formation of protein-protein interactions required for both constitutive and regulated splicing. The prevalence of the RS domain in splicing factors suggests that it might serve as a useful signature for the identification of new proteins that function in pre-mRNA processing, although it remains to be determined whether RS domains also participate in other cellular functions. Using database search and sequence clustering methods, we have identified and categorized RS domain proteins encoded within the entire genomes of Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae. This genome-wide survey revealed a surprising complexity of RS domain proteins in metazoans with functions associated with chromatin structure, transcription by RNA polymerase II, cell cycle, and cell structure, as well as pre-mRNA processing. Also identified were RS domain proteins in S. cerevisiae with functions associated with cell structure, osmotic regulation, and cell cycle progression. The results thus demonstrate an effective strategy for the genomic mining of RS domain proteins. The identification of many new proteins using this strategy has provided a database of factors that are candidates for forming RS domain-mediated interactions associated with different steps in pre-mRNA processing, in addition to other cellular functions.


Assuntos
Motivos de Aminoácidos/genética , Biologia Computacional/métodos , Biologia Molecular/métodos , Estrutura Terciária de Proteína/genética , Animais , Arginina/genética , Caenorhabditis elegans/genética , Ciclo Celular , Cromatina/metabolismo , Drosophila melanogaster/genética , Evolução Molecular , Genoma , Humanos , Fosfoproteínas Fosfatases , Proteínas Quinases , RNA Polimerase II/metabolismo , Processamento Pós-Transcricional do RNA , Projetos de Pesquisa , Saccharomyces cerevisiae/genética , Serina/genética , Transcrição Gênica
13.
Bioinformatics ; 16(10): 915-22, 2000 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-11120681

RESUMO

MOTIVATION: Sensitive detection and masking of low-complexity regions in protein sequences. Filtered sequences can be used in sequence comparison without the risk of matching compositionally biased regions. The main advantage of the method over similar approaches is the selective masking of single residue types without affecting other, possibly important, regions. RESULTS: A novel algorithm for low-complexity region detection and selective masking. The algorithm is based on multiple-pass Smith-Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties. The output of the algorithm is both the masked query sequence for further analysis, e.g. database searches, as well as the regions of low complexity. The detection of low-complexity regions is highly specific for single residue types. It is shown that this approach is sufficient for masking database query sequences without generating false positives. The algorithm is benchmarked against widely available algorithms using the 210 genes of Plasmodium falciparum chromosome 2, a dataset known to contain a large number of low-complexity regions. AVAILABILITY: CAST (version 1.0) executable binaries are available to academic users free of charge under license. Web site entry point, server and additional material: http://www.ebi.ac.uk/research/cgg/services/cast/


Assuntos
Algoritmos , DNA de Protozoário/química , Plasmodium falciparum/genética , Análise de Sequência de DNA/métodos , Animais , DNA de Protozoário/genética , Bases de Dados Factuais , Genes de Protozoários , Fases de Leitura Aberta
14.
Nucleic Acids Res ; 28(22): 4573-6, 2000 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-11071948

RESUMO

The proliferation of genome sequence data has led to the development of a number of tools and strategies that facilitate computational analysis. These methods include the identification of motif patterns, membership of the query sequences in family databases, metabolic pathway involvement and gene proximity. We re-examined the completely sequenced genome of Thermotoga maritima by employing the combined use of the above methods. By analyzing all 1877 proteins encoded in this genome, we identified 193 cases of conflicting annotations (10%), of which 164 are new function predictions and 29 are amendments of previously proposed assignments. These results suggest that the combined use of existing computational tools can resolve inconclusive sequence similarities and significantly improve the prediction of protein function from genome sequence.


Assuntos
Genoma Bacteriano , Alinhamento de Sequência/métodos , Thermotoga maritima/genética , Biologia Computacional , Genes Bacterianos/genética , Fases de Leitura Aberta , Análise de Sequência
16.
FEBS Lett ; 480(1): 42-8, 2000 Aug 25.
Artigo em Inglês | MEDLINE | ID: mdl-10967327

RESUMO

Computational genomics is a subfield of computational biology that deals with the analysis of entire genome sequences. Transcending the boundaries of classical sequence analysis, computational genomics exploits the inherent properties of entire genomes by modelling them as systems. We review recent developments in the field, discuss in some detail a number of novel approaches that take into account the genomic context and argue that progress will be made by novel knowledge representation and simulation technologies.


Assuntos
Biologia Computacional/métodos , Biologia Computacional/tendências , Genes , Genoma , Animais , Simulação por Computador , Bases de Dados como Assunto , Genes/genética , Genes/fisiologia , Humanos , Família Multigênica/genética , Proteínas Recombinantes de Fusão/genética , Alinhamento de Sequência
17.
Pac Symp Biocomput ; : 541-52, 2000.
Artigo em Inglês | MEDLINE | ID: mdl-10902201

RESUMO

This paper motivates the use of Information Extraction (IE) for gathering data on protein interactions, describes the customization of an existing IE system, SRI's Highlight, for this task and presents the results of an experiment on unseen Medline abstracts which show that customization to a new domain can be fast, reliable and cost-effective.


Assuntos
Armazenamento e Recuperação da Informação , MEDLINE , Proteínas/metabolismo , Indexação e Redação de Resumos , Idioma , Linguística
18.
Bioinformatics ; 16(5): 451-7, 2000 May.
Artigo em Inglês | MEDLINE | ID: mdl-10871267

RESUMO

MOTIVATION: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. RESULTS: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith-Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith-Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multi-domain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. AVAILABILITY: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.


Assuntos
Algoritmos , Proteínas/química , Proteínas/genética , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Análise por Conglomerados , Bases de Dados Factuais , Proteínas Fúngicas/química , Proteínas Fúngicas/genética , Genoma Bacteriano , Genoma Fúngico , Estrutura Terciária de Proteína , Alinhamento de Sequência/estatística & dados numéricos
20.
Genome Res ; 10(4): 568-76, 2000 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-10779499

RESUMO

The EcoCyc database characterizes the known network of Escherichia coli small-molecule metabolism. Here we present a computational analysis of the global properties of that network, which consists of 744 reactions that are catalyzed by 607 enzymes. The reactions are organized into 131 pathways. Of the metabolic enzymes, 100 are multifunctional, and 68 of the reactions are catalyzed by >1 enzyme. The network contains 791 chemical substrates. Other properties considered by the analysis include the distribution of enzyme subunit organization, and the distribution of modulators of enzyme activity and of enzyme cofactors. The dimensions chosen for this analysis can be employed for comparative functional analysis of complete genomes.


Assuntos
Escherichia coli/metabolismo , Catálise , Biologia Computacional/métodos , Bases de Dados Factuais , Ativação Enzimática/genética , Escherichia coli/enzimologia , Escherichia coli/genética , Genoma Bacteriano , Complexos Multienzimáticos/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...