Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Nat Methods ; 13(5): 425-30, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-27043882

RESUMEN

Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.


Asunto(s)
Biología Computacional/normas , Genómica/normas , Filogenia , Proteómica/normas , Archaea/clasificación , Archaea/genética , Bacterias/clasificación , Bacterias/genética , Biología Computacional/métodos , Bases de Datos Genéticas , Eucariontes/clasificación , Eucariontes/genética , Ontología de Genes , Genómica/métodos , Modelos Genéticos , Proteómica/métodos , Análisis de Secuencia de Proteína , Homología de Secuencia , Especificidad de la Especie
2.
Nucleic Acids Res ; 41(Web Server issue): W242-8, 2013 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-23685612

RESUMEN

The PhyloFacts 'Fast Approximate Tree Classification' (FAT-CAT) web server provides a novel approach to ortholog identification using subtree hidden Markov model-based placement of protein sequences to phylogenomic orthology groups in the PhyloFacts database. Results on a data set of microbial, plant and animal proteins demonstrate FAT-CAT's high precision at separating orthologs and paralogs and robustness to promiscuous domains. We also present results documenting the precision of ortholog identification based on subtree hidden Markov model scoring. The FAT-CAT phylogenetic placement is used to derive a functional annotation for the query, including confidence scores and drill-down capabilities. PhyloFacts' broad taxonomic and functional coverage, with >7.3 M proteins from across the Tree of Life, enables FAT-CAT to predict orthologs and assign function for most sequence inputs. Four pipeline parameter presets are provided to handle different sequence types, including partial sequences and proteins containing promiscuous domains; users can also modify individual parameters. PhyloFacts trees matching the query can be viewed interactively online using the PhyloScope Javascript tree viewer and are hyperlinked to various external databases. The FAT-CAT web server is available at http://phylogenomics.berkeley.edu/phylofacts/fatcat/.


Asunto(s)
Filogenia , Proteínas/clasificación , Programas Informáticos , Animales , Clasificación/métodos , Internet , Cadenas de Markov , Anotación de Secuencia Molecular , Proteínas/genética , Proteínas/fisiología , Análisis de Secuencia de Proteína
3.
Protein Sci ; 21(6): 769-85, 2012 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-22528593

RESUMEN

Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.


Asunto(s)
Evolución Molecular , Proteínas/química , Proteínas/genética , Secuencia de Aminoácidos , Animales , Humanos , Modelos Moleculares , Datos de Secuencia Molecular , Conformación Proteica , Pliegue de Proteína , ARN Mensajero/genética , Alineación de Secuencia
4.
Biochemistry ; 51(11): 2265-75, 2012 Mar 20.
Artículo en Inglés | MEDLINE | ID: mdl-22324760

RESUMEN

Pyrroloquinoline quinone (PQQ) is a small, redox active molecule that serves as a cofactor for several bacterial dehydrogenases, introducing pathways for carbon utilization that confer a growth advantage. Early studies had implicated a ribosomally translated peptide as the substrate for PQQ production. This study presents a sequence- and structure-based analysis of the components of the pqq operon. We find the necessary components for PQQ production are present in 126 prokaryotes, most of which are Gram-negative and a number of which are pathogens. A total of five gene products, PqqA, PqqB, PqqC, PqqD, and PqqE, are identified as being obligatory for PQQ production. Three of the gene products in the pqq operon, PqqB, PqqC, and PqqE, are members of large protein superfamilies. By combining evolutionary conservation patterns with information from three-dimensional structures, we are able to differentiate the gene products involved in PQQ biosynthesis from those with divergent functions. The observed persistence of a conserved gene order within analyzed operons strongly suggests a role for protein-protein interactions in the course of cofactor biosynthesis. These studies propose previously unidentified roles for several of the gene products, as well as identifying possible new targets for antibiotic design and application.


Asunto(s)
Proteínas Bacterianas/genética , Genes Bacterianos , Klebsiella pneumoniae/metabolismo , Cofactor PQQ/biosíntesis , Cofactor PQQ/genética , Secuencia de Aminoácidos , Proteínas Bacterianas/metabolismo , Modelos Moleculares , Datos de Secuencia Molecular , Operón , Filogenia , Conformación Proteica
5.
Brief Bioinform ; 12(5): 413-22, 2011 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-21712343

RESUMEN

Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area.


Asunto(s)
Biología Computacional/métodos , Genoma , Filogenia , Animales , Bases de Datos Factuales , Evolución Molecular , Humanos , Proteínas/química
6.
Nucleic Acids Res ; 39(Database issue): D465-74, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21097780

RESUMEN

ModBase (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by ModPipe, an automated modeling pipeline that relies primarily on Modeller for fold assignment, sequence-structure alignment, model building and model assessment (http://salilab.org/modeller/). ModBase currently contains 10,355,444 reliable models for domains in 2,421,920 unique protein sequences. ModBase allows users to update comparative models on demand, and request modeling of additional sequences through an interface to the ModWeb modeling server (http://salilab.org/modweb). ModBase models are available through the ModBase interface as well as the Protein Model Portal (http://www.proteinmodelportal.org/). Recently developed associated resources include the SALIGN server for multiple sequence and structure alignment (http://salilab.org/salign), the ModEval server for predicting the accuracy of protein structure models (http://salilab.org/modeval), the PCSS server for predicting which peptides bind to a given protein (http://salilab.org/pcss) and the FoXS server for calculating and fitting Small Angle X-ray Scattering profiles (http://salilab.org/foxs).


Asunto(s)
Bases de Datos de Proteínas , Modelos Moleculares , Estructura Terciaria de Proteína , Proteínas Bacterianas/química , Gráficos por Computador , Péptidos/química , Mapeo de Interacción de Proteínas , Proteínas/química , Dispersión del Ángulo Pequeño , Alineación de Secuencia , Programas Informáticos , Homología Estructural de Proteína , Interfaz Usuario-Computador , Difracción de Rayos X
7.
PLoS One ; 5(7): e11688, 2010 Jul 21.
Artículo en Inglés | MEDLINE | ID: mdl-20657737

RESUMEN

A significant fraction of a plant's nuclear genome encodes chloroplast-targeted proteins, many of which are devoted to the assembly and function of the photosynthetic apparatus. Using digital video imaging of chlorophyll fluorescence, we isolated proton gradient regulation 7 (pgr7) as an Arabidopsis thaliana mutant with low nonphotochemical quenching of chlorophyll fluorescence (NPQ). In pgr7, the xanthophyll cycle and the PSBS gene product, previously identified NPQ factors, were still functional, but the efficiency of photosynthetic electron transport was lower than in the wild type. The pgr7 mutant was also smaller in size and had lower chlorophyll content than the wild type in optimal growth conditions. Positional cloning located the pgr7 mutation in the At3g21200 (PGR7) gene, which was predicted to encode a chloroplast protein of unknown function. Chloroplast targeting of PGR7 was confirmed by transient expression of a GFP fusion protein and by stable expression and subcellular localization of an epitope-tagged version of PGR7. Bioinformatic analyses revealed that the PGR7 protein has two domains that are conserved in plants, algae, and bacteria, and the N-terminal domain is predicted to bind a cofactor such as FMN. Thus, we identified PGR7 as a novel, conserved nuclear gene that is necessary for efficient photosynthetic electron transport in chloroplasts of Arabidopsis.


Asunto(s)
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Transporte de Electrón/fisiología , Proteínas Fluorescentes Verdes/metabolismo , Fotosíntesis/fisiología , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Clorofila/metabolismo , Biología Computacional , Transporte de Electrón/genética , Proteínas Fluorescentes Verdes/genética , Immunoblotting , Fenotipo , Fotosíntesis/genética , Filogenia , Plantas Modificadas Genéticamente/genética , Plantas Modificadas Genéticamente/metabolismo
8.
Nucleic Acids Res ; 38(Web Server issue): W29-34, 2010 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-20430824

RESUMEN

We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.


Asunto(s)
Filogenia , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína , Programas Informáticos , Algoritmos , Internet , Cadenas de Markov , Estructura Terciaria de Proteína
9.
PLoS Comput Biol ; 6(1): e1000621, 2010 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-20126522

Asunto(s)
Genómica , Filogenia
10.
Bioinformatics ; 26(5): 617-24, 2010 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-20080507

RESUMEN

MOTIVATION: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.5% at a corresponding recall (the fraction of catalytic residues identified) of 57% on a standard benchmark. Here we present a new method, Discern, which provides a significant improvement over the state-of-the-art through the use of statistical techniques to derive a model with a small set of features that are jointly predictive of enzyme active sites. RESULTS: In cross-validation experiments on two benchmark datasets from the Catalytic Site Atlas and CATRES resources containing a total of 437 manually curated enzymes spanning 487 SCOP families, Discern increases catalytic site recall between 12% and 20% over methods that combine information from both sequence and structure, and by >or=50% over methods that make use of sequence conservation signal only. Controlled experiments show that Discern's improvement in catalytic residue prediction is derived from the combination of three ingredients: the use of the INTREPID phylogenomic method to extract conservation information; the use of 3D structure data, including features computed for residues that are proximal in the structure; and a statistical regularization procedure to prevent overfitting.


Asunto(s)
Dominio Catalítico/genética , Evolución Molecular , Conformación Proteica , Proteínas/química , Proteómica/métodos , Sitios de Unión , Catálisis , Bases de Datos de Proteínas , Modelos Moleculares , Pliegue de Proteína , Análisis de Secuencia de Proteína
11.
BMC Bioinformatics ; 10: 197, 2009 Jun 27.
Artículo en Inglés | MEDLINE | ID: mdl-19558703

RESUMEN

BACKGROUND: Identifying the catalytic residues in enzymes can aid in understanding the molecular basis of an enzyme's function and has significant implications for designing new drugs, identifying genetic disorders, and engineering proteins with novel functions. Since experimentally determining catalytic sites is expensive, better computational methods for identifying catalytic residues are needed. RESULTS: We propose ResBoost, a new computational method to learn characteristics of catalytic residues. The method effectively selects and combines rules of thumb into a simple, easily interpretable logical expression that can be used for prediction. We formally define the rules of thumb that are often used to narrow the list of candidate residues, including residue evolutionary conservation, 3D clustering, solvent accessibility, and hydrophilicity. ResBoost builds on two methods from machine learning, the AdaBoost algorithm and Alternating Decision Trees, and provides precise control over the inherent trade-off between sensitivity and specificity. We evaluated ResBoost using cross-validation on a dataset of 100 enzymes from the hand-curated Catalytic Site Atlas (CSA). CONCLUSION: ResBoost achieved 85% sensitivity for a 9.8% false positive rate and 73% sensitivity for a 5.7% false positive rate. ResBoost reduces the number of false positives by up to 56% compared to the use of evolutionary conservation scoring alone. We also illustrate the ability of ResBoost to identify recently validated catalytic residues not listed in the CSA.


Asunto(s)
Biología Computacional/métodos , Enzimas/química , Programas Informáticos , Sitios de Unión , Catálisis , Bases de Datos de Proteínas
12.
Nucleic Acids Res ; 37(Web Server issue): W84-9, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19435885

RESUMEN

Ortholog detection is essential in functional annotation of genomes, with applications to phylogenetic tree construction, prediction of protein-protein interaction and other bioinformatics tasks. We present here the PHOG web server employing a novel algorithm to identify orthologs based on phylogenetic analysis. Results on a benchmark dataset from the TreeFam-A manually curated orthology database show that PHOG provides a combination of high recall and precision competitive with both InParanoid and OrthoMCL, and allows users to target different taxonomic distances and precision levels through the use of tree-distance thresholds. For instance, OrthoMCL-DB achieved 76% recall and 66% precision on this dataset; at a slightly higher precision (68%) PHOG achieves 10% higher recall (86%). InParanoid achieved 87% recall at 24% precision on this dataset, while a PHOG variant designed for high recall achieves 88% recall at 61% precision, increasing precision by 37% over InParanoid. PHOG is based on pre-computed trees in the PhyloFacts resource, and contains over 366 K orthology groups with a minimum of three species. Predicted orthologs are linked to GO annotations, pathway information and biological literature. The PHOG web server is available at http://phylofacts.berkeley.edu/orthologs/.


Asunto(s)
Filogenia , Programas Informáticos , Algoritmos , Animales , Humanos , Internet , Ratones , Reproducibilidad de los Resultados , Análisis de Secuencia de Proteína , Interfaz Usuario-Computador
13.
Nucleic Acids Res ; 37(Web Server issue): W390-5, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19443452

RESUMEN

We present the INTREPID web server for predicting functionally important residues in proteins. INTREPID has been shown to boost the recall and precision of catalytic residue prediction over other sequence-based methods and can be used to identify other types of functional residues. The web server takes an input protein sequence, gathers homologs, constructs a multiple sequence alignment and phylogenetic tree and finally runs the INTREPID method to assign a score to each position. Residues predicted to be functionally important are displayed on homologous 3D structures (where available), highlighting spatial patterns of conservation at various significance thresholds. The INTREPID web server is available at http://phylogenomics.berkeley.edu/intrepid.


Asunto(s)
Proteínas/química , Programas Informáticos , Aminoácidos/química , Dominio Catalítico , Internet , Modelos Moleculares , Filogenia , Conformación Proteica , Proteínas/clasificación , Proteínas/genética , Análisis de Secuencia de Proteína , Homología de Secuencia de Aminoácido , Interfaz Usuario-Computador
14.
Bioinformatics ; 24(21): 2445-52, 2008 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-18776193

RESUMEN

MOTIVATION: Identification of functionally important residues in proteins plays a significant role in biological discovery. Here, we present INTREPID--an information-theoretic approach for functional site identification that exploits the information in large diverse multiple sequence alignments (MSAs). INTREPID uses a traversal of the phylogeny in combination with a positional conservation score, based on Jensen-Shannon divergence, to rank positions in an MSA. While knowledge of protein 3D structure can significantly improve the accuracy of functional site identification, since structural information is not available for a majority of proteins, INTREPID relies solely on sequence information. We evaluated INTREPID on two tasks: predicting catalytic residues and predicting specificity determinants. RESULTS: In catalytic residue prediction, INTREPID provides significant improvements over Evolutionary Trace, ConSurf as well as over a baseline global conservation method on a set of 100 manually curated enzymes from the Catalytic Site Atlas. In particular, INTREPID is able to better predict catalytic positions that are not globally conserved and hence, attains improved sensitivity at high values of specificity. We also investigated the performance of INTREPID as a function of the evolutionary divergence of the protein family. We found that INTREPID is better able to exploit the diversity in such families and that accuracy improves when homologs with very low sequence identity are included in an alignment. In specificity determinant prediction, when subtype information is known, INTREPID-SPEC, a variant of INTREPID, attains accuracies that are competitive with other approaches for this task. AVAILABILITY: INTREPID is available for 16919 families in the PhyloFacts resource (http://phylogenomics.berkeley.edu/phylofacts).


Asunto(s)
Algoritmos , Proteínas/química , Sitios de Unión , Bases de Datos de Proteínas , Conformación Proteica , Proteínas/genética , Alineación de Secuencia , Análisis de Secuencia de Proteína , Programas Informáticos
15.
Nucleic Acids Res ; 36(Database issue): D943-6, 2008 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-17933772

RESUMEN

The Generation Challenge Programme (GCP; www.generationcp.org) has developed an online resource documenting stress-responsive genes comparatively across plant species. This public resource is a compendium of protein families, phylogenetic trees, multiple sequence alignments (MSA) and associated experimental evidence. The central objective of this resource is to elucidate orthologous and paralogous relationships between plant genes that may be involved in response to environmental stress, mainly abiotic stresses such as water deficit ('drought'). The web-based graphical user interface (GUI) of the resource includes query and visualization tools that allow diverse searches and browsing of the underlying project database. The web interface can be accessed at http://dayhoff.generationcp.org.


Asunto(s)
Productos Agrícolas/genética , Bases de Datos Genéticas , Genes de Plantas , Productos Agrícolas/metabolismo , Deshidratación , Ambiente , Perfilación de la Expresión Génica , Internet , Filogenia , Proteínas de Plantas/química , Proteínas de Plantas/clasificación , Alineación de Secuencia , Interfaz Usuario-Computador
16.
PLoS Comput Biol ; 3(8): e160, 2007 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-17708678

RESUMEN

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.


Asunto(s)
Algoritmos , Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/química , Proteínas/clasificación , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Cadenas de Markov , Datos de Secuencia Molecular , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
17.
Nucleic Acids Res ; 35(Web Server issue): W27-32, 2007 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-17488835

RESUMEN

Phylogenomic analysis addresses the limitations of function prediction based on annotation transfer, and has been shown to enable the highest accuracy in prediction of protein molecular function. The Berkeley Phylogenomics Group provides a series of web servers for phylogenomic analysis: classification of sequences to pre-computed families and subfamilies using the PhyloFacts Phylogenomic Encyclopedia, FlowerPower clustering of proteins sharing the same domain architecture, MUSCLE multiple sequence alignment, SATCHMO simultaneous alignment and tree construction and SCI-PHY subfamily identification. The PhyloBuilder web server provides an integrated phylogenomic pipeline starting with a user-supplied protein sequence, proceeding to homolog identification, multiple alignment, phylogenetic tree construction, subfamily identification and structure prediction. The Berkeley Phylogenomics Group resources are available at http://phylogenomics.berkeley.edu.


Asunto(s)
Biología Computacional/métodos , Filogenia , Algoritmos , Animales , Computadores , Bases de Datos Genéticas , Bases de Datos de Proteínas , Humanos , Internet , Modelos Genéticos , Conformación Proteica , Alineación de Secuencia , Análisis de Secuencia de Proteína , Programas Informáticos , Interfaz Usuario-Computador
18.
BMC Evol Biol ; 7 Suppl 1: S12, 2007 Feb 08.
Artículo en Inglés | MEDLINE | ID: mdl-17288570

RESUMEN

BACKGROUND: Function prediction by transfer of annotation from the top database hit in a homology search has been shown to be prone to systematic error. Phylogenomic analysis reduces these errors by inferring protein function within the evolutionary context of the entire family. However, accuracy of function prediction for multi-domain proteins depends on all members having the same overall domain structure. By contrast, most common homolog detection methods are optimized for retrieving local homologs, and do not address this requirement. RESULTS: We present FlowerPower, a novel clustering algorithm designed for the identification of global homologs as a precursor to structural phylogenomic analysis. Similar to methods such as PSIBLAST, FlowerPower employs an iterative approach to clustering sequences. However, rather than using a single HMM or profile to expand the cluster, FlowerPower identifies subfamilies using the SCI-PHY algorithm and then selects and aligns new homologs using subfamily hidden Markov models. FlowerPower is shown to outperform BLAST, PSI-BLAST and the UCSC SAM-Target 2K methods at discrimination between proteins in the same domain architecture class and those having different overall domain structures. CONCLUSION: Structural phylogenomic analysis enables biologists to avoid the systematic errors associated with annotation transfer; clustering sequences based on sharing the same domain architecture is a critical first step in this process. FlowerPower is shown to consistently identify homologous sequences having the same domain architecture as the query. AVAILABILITY: FlowerPower is available as a webserver at http://phylogenomics.berkeley.edu/flowerpower/.


Asunto(s)
Algoritmos , Filogenia , Estructura Terciaria de Proteína , Proteínas/fisiología , Análisis de Secuencia de Proteína/métodos , Animales , Análisis por Conglomerados , Bases de Datos Genéticas , Humanos , Proteínas/clasificación , Proyectos de Investigación , Alineación de Secuencia
19.
Genome Biol ; 7(9): R83, 2006.
Artículo en Inglés | MEDLINE | ID: mdl-16973001

RESUMEN

The Berkeley Phylogenomics Group presents PhyloFacts, a structural phylogenomic encyclopedia containing almost 10,000 'books' for protein families and domains, with pre-calculated structural, functional and evolutionary analyses. PhyloFacts enables biologists to avoid the systematic errors associated with function prediction by homology through the integration of a variety of experimental data and bioinformatics methods in an evolutionary framework. Users can submit sequences for classification to families and functional subfamilies. PhyloFacts is available as a worldwide web resource from http://phylogenomics.berkeley.edu/phylofacts.


Asunto(s)
Bases de Datos de Proteínas , Proteínas , Animales , Evolución Molecular , Humanos , Filogenia , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Relación Estructura-Actividad
20.
OMICS ; 10(2): 231-7, 2006.
Artículo en Inglés | MEDLINE | ID: mdl-16901231

RESUMEN

In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses. Encouraged by the development of the Minimum Information About a Microarray Experiment (MIAME) standard, we propose a similar roadmap for the development of a Minimal Information About a Phylogenetic Analysis (MIAPA) standard. Key in the successful development and implementation of such a standard will be broad participation by developers of phylogenetic analysis software, phylogenetic database developers, practitioners of phylogenomics, and journal editors.


Asunto(s)
Filogenia , Estándares de Referencia , Genómica/normas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...