Pesquisa | Portal de Pesquisa da BVS

1.

Deep learning for the PSIPRED Protein Analysis Workbench.

Buchan, Daniel W A; Moffat, Lewis; Lau, Andy; Kandathil, Shaun M; Jones, David T.

Nucleic Acids Res ; 52(W1): W287-W293, 2024 Jul 05.

Artigo em Inglês | MEDLINE | ID: mdl-38747351

RESUMO

The PSIRED Workbench is a long established and popular bioinformatics web service offering a wide range of machine learning based analyses for characterizing protein structure and function. In this paper we provide an update of the recent additions and developments to the webserver, with a focus on new Deep Learning based methods. We briefly discuss some trends in server usage since the publication of AlphaFold2 and we give an overview of some upcoming developments for the service. The PSIPRED Workbench is available at http://bioinf.cs.ucl.ac.uk/psipred.

Assuntos

Aprendizado Profundo , Proteínas , Software , Proteínas/química , Proteínas/genética , Internet , Conformação Proteica , Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos

2.

Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation.

Sillitoe, Ian; Andreeva, Antonina; Blundell, Tom L; Buchan, Daniel W A; Finn, Robert D; Gough, Julian; Jones, David; Kelley, Lawrence A; Paysan-Lafosse, Typhaine; Lam, Su Datt; Murzin, Alexey G; Pandurangan, Arun Prasad; Salazar, Gustavo A; Skwark, Marcin J; Sternberg, Michael J E; Velankar, Sameer; Orengo, Christine.

Nucleic Acids Res ; 48(D1): D314-D319, 2020 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-31733063

RESUMO

Genome3D (https://www.genome3d.eu) is a freely available resource that provides consensus structural annotations for representative protein sequences taken from a selection of model organisms. Since the last NAR update in 2015, the method of data submission has been overhauled, with annotations now being 'pushed' to the database via an API. As a result, contributing groups are now able to manage their own structural annotations, making the resource more flexible and maintainable. The new submission protocol brings a number of additional benefits including: providing instant validation of data and avoiding the requirement to synchronise releases between resources. It also makes it possible to implement the submission of these structural annotations as an automated part of existing internal workflows. In turn, these improvements facilitate Genome3D being opened up to new prediction algorithms and groups. For the latest release of Genome3D (v2.1), the underlying dataset of sequences used as prediction targets has been updated using the latest reference proteomes available in UniProtKB. A number of new reference proteomes have also been added of particular interest to the wider scientific community: cow, pig, wheat and mycobacterium tuberculosis. These additions, along with improvements to the underlying predictions from contributing resources, has ensured that the number of annotations in Genome3D has nearly doubled since the last NAR update article. The new API has also been used to facilitate the dissemination of Genome3D data into InterPro, thereby widening the visibility of both the annotation data and annotation algorithms.

Assuntos

Proteínas/química , Bases de Dados de Proteínas , Proteínas/classificação , Proteínas/genética , Interface Usuário-Computador

3.

The PSIPRED Protein Analysis Workbench: 20 years on.

Buchan, Daniel W A; Jones, David T.

Nucleic Acids Res ; 47(W1): W402-W407, 2019 07 02.

Artigo em Inglês | MEDLINE | ID: mdl-31251384

RESUMO

The PSIPRED Workbench is a web server offering a range of predictive methods to the bioscience community for 20 years. Here, we present the work we have completed to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years. The main focus of our recent website upgrade work has been the acceleration of analyses in the face of increasing protein sequence database size. We additionally discuss any new software, the new hardware infrastructure, our webservices and web site. Lastly we survey updates to some of the key predictive algorithms available through our website.

Assuntos

Ontologia Genética/tendências , Anotação de Sequência Molecular/métodos , Proteínas/química , Software/história , Sequência de Aminoácidos , Sítios de Ligação , Ontologia Genética/história , História do Século XXI , Internet , Modelos Moleculares , Anotação de Sequência Molecular/história , Ligação Proteica , Conformação Proteica em alfa-Hélice , Conformação Proteica em Folha beta , Domínios e Motivos de Interação entre Proteínas , Proteínas/história , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos

4.

Learning a functional grammar of protein domains using natural language word embedding techniques.

Buchan, Daniel W A; Jones, David T.

Proteins ; 88(4): 616-624, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-31703152

RESUMO

In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words." Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.

Assuntos

Anotação de Sequência Molecular/métodos , Processamento de Linguagem Natural , Proteínas/química , Semântica , Bases de Dados de Proteínas , Conjuntos de Dados como Assunto , Ontologia Genética , Humanos , Domínios Proteicos , Proteínas/fisiologia

5.

Improved protein contact predictions with the MetaPSICOV2 server in CASP12.

Buchan, Daniel W A; Jones, David T.

Proteins ; 86 Suppl 1: 78-83, 2018 03.

Artigo em Inglês | MEDLINE | ID: mdl-28901583

RESUMO

In this paper, we present the results for the MetaPSICOV2 contact prediction server in the CASP12 community experiment (http://predictioncenter.org). Over the 35 assessed Free Modelling target domains the MetaPSICOV2 server achieved a mean precision of 43.27%, a substantial increase relative to the server's performance in the CASP11 experiment. In the following paper, we discuss improvements to the MetaPSICOV2 server, covering both changes to the neural network and attempts to integrate contact predictions on a domain basis into the prediction pipeline. We also discuss some limitations in the CASP12 assessment which may have overestimated the performance of our method.

Assuntos

Biologia Computacional/métodos , Internet , Aprendizado de Máquina , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Proteínas/química , Algoritmos , Cristalografia por Raios X , Humanos , Domínios e Motivos de Interação entre Proteínas , Software

6.

EigenTHREADER: analogous protein fold recognition by efficient contact map threading.

Buchan, Daniel W A; Jones, David T.

Bioinformatics ; 33(17): 2684-2690, 2017 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-28419258

RESUMO

MOTIVATION: Protein fold recognition when appropriate, evolutionarily-related, structural templates can be identified is often trivial and may even be viewed as a solved problem. However in cases where no homologous structural templates can be detected, fold recognition is a notoriously difficult problem ( Moult et al., 2014 ). Here we present EigenTHREADER, a novel fold recognition method capable of identifying folds where no homologous structures can be identified. EigenTHREADER takes a query amino acid sequence, generates a map of intra-residue contacts, and then searches a library of contact maps of known structures. To allow the contact maps to be compared, we use eigenvector decomposition to resolve the principal eigenvectors these can then be aligned using standard dynamic programming algorithms. The approach is similar to the Al-Eigen approach of Di Lena et al. (2010) , but with improvements made both to speed and accuracy. With this search strategy, EigenTHREADER does not depend directly on sequence homology between the target protein and entries in the fold library to generate models. This in turn enables EigenTHREADER to correctly identify analogous folds where little or no sequence homology information is. RESULTS: EigenTHREADER outperforms well-established fold recognition methods such as pGenTHREADER and HHSearch in terms of True Positive Rate in the difficult task of analogous fold recognition. This should allow template-based modelling to be extended to many new protein families that were previously intractable to homology based fold recognition methods. AVAILABILITY AND IMPLEMENTATION: All code used to generate these results and the computational protocol can be downloaded from https://github.com/DanBuchan/eigen_scripts . EigenTHREADER, the benchmark code and the data this paper is based on can be downloaded from: http://bioinfadmin.cs.ucl.ac.uk/downloads/eigenTHREADER/ . CONTACT: d.t.jones@ucl.ac.uk.

Assuntos

Biologia Computacional/métodos , Modelos Moleculares , Dobramento de Proteína , Análise de Sequência de Proteína/métodos , Software , Algoritmos

7.

Genome3D: exploiting structure to help users understand their sequences.

Lewis, Tony E; Sillitoe, Ian; Andreeva, Antonina; Blundell, Tom L; Buchan, Daniel W A; Chothia, Cyrus; Cozzetto, Domenico; Dana, José M; Filippis, Ioannis; Gough, Julian; Jones, David T; Kelley, Lawrence A; Kleywegt, Gerard J; Minneci, Federico; Mistry, Jaina; Murzin, Alexey G; Ochoa-Montaño, Bernardo; Oates, Matt E; Punta, Marco; Rackham, Owen J L; Stahlhacke, Jonathan; Sternberg, Michael J E; Velankar, Sameer; Orengo, Christine.

Nucleic Acids Res ; 43(Database issue): D382-6, 2015 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-25348407

RESUMO

Genome3D (http://www.genome3d.eu) is a collaborative resource that provides predicted domain annotations and structural models for key sequences. Since introducing Genome3D in a previous NAR paper, we have substantially extended and improved the resource. We have annotated representatives from Pfam families to improve coverage of diverse sequences and added a fast sequence search to the website to allow users to find Genome3D-annotated sequences similar to their own. We have improved and extended the Genome3D data, enlarging the source data set from three model organisms to 10, and adding VIVACE, a resource new to Genome3D. We have analysed and updated Genome3D's SCOP/CATH mapping. Finally, we have improved the superposition tools, which now give users a more powerful interface for investigating similarities and differences between structural models.

Assuntos

Bases de Dados de Proteínas , Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Algoritmos , Genômica , Internet , Modelos Moleculares , Estrutura Terciária de Proteína/genética , Análise de Sequência de Proteína

8.

A large-scale evaluation of computational protein function prediction.

Radivojac, Predrag; Clark, Wyatt T; Oron, Tal Ronnen; Schnoes, Alexandra M; Wittkop, Tobias; Sokolov, Artem; Graim, Kiley; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa; Pandey, Gaurav; Yunes, Jeffrey M; Talwalkar, Ameet S; Repo, Susanna; Souza, Michael L; Piovesan, Damiano; Casadio, Rita; Wang, Zheng; Cheng, Jianlin; Fang, Hai; Gough, Julian; Koskinen, Patrik; Törönen, Petri; Nokso-Koivisto, Jussi; Holm, Liisa; Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T; Limaye, Bhakti; Inamdar, Harshal; Datta, Avik; Manjari, Sunitha K; Joshi, Rajendra; Chitale, Meghana; Kihara, Daisuke; Lisewski, Andreas M; Erdin, Serkan; Venner, Eric; Lichtarge, Olivier; Rentzsch, Robert; Yang, Haixuan; Romero, Alfonso E; Bhat, Prajwal; Paccanaro, Alberto; Hamp, Tobias; Kaßner, Rebecca; Seemayer, Stefan; Vicedo, Esmeralda; Schaefer, Christian.

Nat Methods ; 10(3): 221-7, 2013 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-23353650

RESUMO

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

Assuntos

Biologia Computacional/métodos , Biologia Molecular/métodos , Anotação de Sequência Molecular , Proteínas/fisiologia , Algoritmos , Animais , Bases de Dados de Proteínas , Exorribonucleases/classificação , Exorribonucleases/genética , Exorribonucleases/fisiologia , Previsões , Humanos , Proteínas/química , Proteínas/classificação , Proteínas/genética , Especificidade da Espécie

9.

Scalable web services for the PSIPRED Protein Analysis Workbench.

Buchan, Daniel W A; Minneci, Federico; Nugent, Tim C O; Bryson, Kevin; Jones, David T.

Nucleic Acids Res ; 41(Web Server issue): W349-57, 2013 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-23748958

RESUMO

Here, we present the new UCL Bioinformatics Group's PSIPRED Protein Analysis Workbench. The Workbench unites all of our previously available analysis methods into a single web-based framework. The new web portal provides a greatly streamlined user interface with a number of new features to allow users to better explore their results. We offer a number of additional services to enable computationally scalable execution of our prediction methods; these include SOAP and XML-RPC web server access and new HADOOP packages. All software and services are available via the UCL Bioinformatics Group website at http://bioinf.cs.ucl.ac.uk/.

Assuntos

Conformação Proteica , Software , Animais , Internet , Proteínas de Membrana/química , Camundongos , Proteínas/química , Análise de Sequência de Proteína , Homologia Estrutural de Proteína

10.

Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains.

Lewis, Tony E; Sillitoe, Ian; Andreeva, Antonina; Blundell, Tom L; Buchan, Daniel W A; Chothia, Cyrus; Cuff, Alison; Dana, Jose M; Filippis, Ioannis; Gough, Julian; Hunter, Sarah; Jones, David T; Kelley, Lawrence A; Kleywegt, Gerard J; Minneci, Federico; Mitchell, Alex; Murzin, Alexey G; Ochoa-Montaño, Bernardo; Rackham, Owen J L; Smith, James; Sternberg, Michael J E; Velankar, Sameer; Yeats, Corin; Orengo, Christine.

Nucleic Acids Res ; 41(Database issue): D499-507, 2013 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-23203986

RESUMO

Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence-structure-function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker's yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Genômica , Humanos , Internet , Anotação de Sequência Molecular , Proteínas/química , Proteínas/classificação , Proteínas/genética , Software

11.

Protein function prediction by massive integration of evolutionary analyses and multiple data sources.

Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T.

BMC Bioinformatics ; 14 Suppl 3: S1, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23514099

RESUMO

BACKGROUND: Accurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of protein function. The CAFA experiment provided a unique opportunity to undertake comprehensive 'blind testing' of many diverse approaches for automated function prediction. We report on the methodology we used for this challenge and on the lessons we learnt. METHODS: Our method integrates into a single framework a wide variety of biological information sources, encompassing sequence, gene expression and protein-protein interaction data, as well as annotations in UniProt entries. The methodology transfers functional categories based on the results from complementary homology-based and feature-based analyses. We generated the final molecular function and biological process assignments by combining the initial predictions in a probabilistic manner, which takes into account the Gene Ontology hierarchical structure. RESULTS: We propose a novel scoring function called COmbined Graph-Information Content similarity (COGIC) score for the comparison of predicted functional categories and benchmark data. We demonstrate that our integrative approach provides increased scope and accuracy over both the component methods and the naïve predictors. In line with previous studies, we find that molecular function predictions are more accurate than biological process assignments. CONCLUSIONS: Overall, the results indicate that there is considerable room for improvement in the field. It still remains for the community to invest a great deal of effort to make automated function prediction a useful and routine component in the toolbox of life scientists. As already witnessed in other areas, community-wide blind testing experiments will be pivotal in establishing standards for the evaluation of prediction accuracy, in fostering advancements and new ideas, and ultimately in recording progress.

Assuntos

Proteínas/fisiologia , Biologia Computacional/métodos , Bases de Dados de Proteínas , Evolução Molecular , Expressão Gênica , Anotação de Sequência Molecular , Mapeamento de Interação de Proteínas , Proteínas/química , Proteínas/genética , Análise de Sequência

12.

PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

Jones, David T; Buchan, Daniel W A; Cozzetto, Domenico; Pontil, Massimiliano.

Bioinformatics ; 28(2): 184-90, 2012 Jan 15.

Artigo em Inglês | MEDLINE | ID: mdl-22101153

RESUMO

MOTIVATION: The accurate prediction of residue-residue contacts, critical for maintaining the native fold of a protein, remains an open problem in the field of structural bioinformatics. Interest in this long-standing problem has increased recently with algorithmic improvements and the rapid growth in the sizes of sequence families. Progress could have major impacts in both structure and function prediction to name but two benefits. Sequence-based contact predictions are usually made by identifying correlated mutations within multiple sequence alignments (MSAs), most commonly through the information-theoretic approach of calculating mutual information between pairs of sites in proteins. These predictions are often inaccurate because the true covariation signal in the MSA is often masked by biases from many ancillary indirect-coupling or phylogenetic effects. Here we present a novel method, PSICOV, which introduces the use of sparse inverse covariance estimation to the problem of protein contact prediction. Our method builds on work which had previously demonstrated corrections for phylogenetic and entropic correlation noise and allows accurate discrimination of direct from indirectly coupled mutation correlations in the MSA. RESULTS: PSICOV displays a mean precision substantially better than the best performing normalized mutual information approach and Bayesian networks. For 118 out of 150 targets, the L/5 (i.e. top-L/5 predictions for a protein of length L) precision for long-range contacts (sequence separation >23) was ≥ 0.5, which represents an improvement sufficient to be of significant benefit in protein structure prediction or model quality assessment. AVAILABILITY: The PSICOV source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/PSICOV.

Assuntos

Algoritmos , Proteínas/química , Alinhamento de Sequência/métodos , Teorema de Bayes , Mutação , Filogenia , Proteínas/genética

13.

The Genome3D Consortium for Structural Annotations of Selected Model Organisms.

Waman, Vaishali P; Blundell, Tom L; Buchan, Daniel W A; Gough, Julian; Jones, David; Kelley, Lawrence; Murzin, Alexey; Pandurangan, Arun Prasad; Sillitoe, Ian; Sternberg, Michael; Torres, Pedro; Orengo, Christine.

Methods Mol Biol ; 2165: 27-67, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32621218

RESUMO

Genome3D consortium is a collaborative project involving protein structure prediction and annotation resources developed by six world-leading structural bioinformatics groups, based in the United Kingdom (namely Blundell, Murzin, Gough, Sternberg, Orengo, and Jones). The main objective of Genome3D serves as a common portal to provide both predicted models and annotations of proteins in model organisms, using several resources developed by these labs such as CATH-Gene3D, DOMSERF, pDomTHREADER, PHYRE, SUPERFAMILY, FUGUE/TOCATTA, and VIVACE. These resources primarily use SCOP- and/or CATH-based protein domain assignments. Another objective of Genome3D is to compare structural classifications of protein domains in CATH and SCOP databases and to provide a consensus mapping of CATH and SCOP protein superfamilies. CATH/SCOP mapping analyses led to the identification of total of 1429 consensus superfamilies.Currently, Genome3D provides structural annotations for ten model organisms, including Homo sapiens, Arabidopsis thaliana, Mus musculus, Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Plasmodium falciparum, Staphylococcus aureus, and Schizosaccharomyces pombe. Thus, Genome3D serves as a common gateway to each structure prediction/annotation resource and allows users to perform comparative assessment of the predictions. It, thus, assists researchers to broaden their perspective on structure/function predictions of their query protein of interest in selected model organisms.

Assuntos

Genômica/organização & administração , Bases de Conhecimento , Anotação de Sequência Molecular/métodos , Proteoma/química , Animais , Arabidopsis , Genoma , Genômica/métodos , Humanos , Disseminação de Informação , Alinhamento de Sequência/métodos , Reino Unido , Leveduras

14.

Predictions of Backbone Dynamics in Intrinsically Disordered Proteins Using De Novo Fragment-Based Protein Structure Predictions.

Kosciolek, Tomasz; Buchan, Daniel W A; Jones, David T.

Sci Rep ; 7(1): 6999, 2017 08 01.

Artigo em Inglês | MEDLINE | ID: mdl-28765603

RESUMO

Intrinsically disordaered proteins (IDPs) are a prevalent phenomenon with over 30% of human proteins estimated to have long disordered regions. Computational methods are widely used to study IDPs, however, nearly all treat disorder in a binary fashion, not accounting for the structural heterogeneity present in disordered regions. Here, we present a new de novo method, FRAGFOLD-IDP, which addresses this problem. Using 200 protein structural ensembles derived from NMR, we show that FRAGFOLD-IDP achieves superior results compared to methods which can predict related data (NMR order parameter, or crystallographic B-factor). FRAGFOLD-IDP produces very good predictions for 33.5% of cases and helps to get a better insight into the dynamics of the disordered ensembles. The results also show it is not necessary to predict the correct fold of the protein to reliably predict per-residue fluctuations. It implies that disorder is a local property and it does not depend on the fold. Our results are orthogonal to DynaMine, the only other method significantly better than the naïve prediction. We therefore combine these two using a neural network. FRAGFOLD-IDP enables better insight into backbone dynamics in IDPs and opens exciting possibilities for the design of disordered ensembles, disorder-to-order transitions, or design for protein dynamics.

Assuntos

Biologia Computacional/métodos , Proteínas Intrinsicamente Desordenadas/química , Biologia Molecular/métodos , Cristalografia por Raios X , Espectroscopia de Ressonância Magnética , Modelos Moleculares , Redes Neurais de Computação

15.

Gene3D: structural assignments for the biologist and bioinformaticist alike.

Buchan, Daniel W A; Rison, Stuart C G; Bray, James E; Lee, David; Pearl, Frances; Thornton, Janet M; Orengo, Christine A.

Nucleic Acids Res ; 31(1): 469-73, 2003 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-12520054

RESUMO

The Gene3D database (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/) provides structural assignments for genes within complete genomes. These are available via the internet from either the World Wide Web or FTP. Assignments are made using PSI-BLAST and subsequently processed using the DRange protocol. The DRange protocol is an empirically benchmarked method for assessing the validity of structural assignments made using sequence searching methods where appropriate assignment statistics are collected and made available. Gene3D links assignments to their appropriate entries in relevent structural and classification resources (PDBsum, CATH database and the Dictionary of Homologous Superfamilies). Release 2.0 of Gene3D includes 62 genomes, 2 eukaryotes, 10 archaea and 40 bacteria. Currently, structural assignments can be made for between 30 and 40 percent of any given genome. In any genome, around half of those genes assigned a structural domain are assigned a single domain and the other half of the genes are assigned multiple structural domains. Gene3D is linked to the CATH database and is updated with each new update of CATH.

Assuntos

Bases de Dados Genéticas , Genoma , Estrutura Terciária de Proteína , Proteínas/química , Animais , Biologia Computacional , Genoma Arqueal , Genoma Bacteriano , Imageamento Tridimensional , Internet , Proteínas/fisiologia , Homologia Estrutural de Proteína

16.

Evolution of protein superfamilies and bacterial genome size.

Ranea, Juan A G; Buchan, Daniel W A; Thornton, Janet M; Orengo, Christine A.

J Mol Biol ; 336(4): 871-87, 2004 Feb 27.

Artigo em Inglês | MEDLINE | ID: mdl-15095866

RESUMO

We present the structural annotation of 56 different bacterial species based on the assignment of genes to 816 evolutionary superfamilies in the CATH domain structure database. These assignments have enabled us to analyse the recurrence of specific superfamilies within and across the genomes. We have selected the superfamilies that have a very broad representation and therefore appear to be universally distributed in a significant number of bacterial lineages. Occurrence profiles of these universally distributed superfamilies are compared with genome size in order to estimate the correlation between superfamily duplication and the increase in proteome size. This distinguishes between those size-dependent superfamilies where frequency of occurrence is highly correlated with increase in genome size, and size-independent superfamilies where no correlation is observed. Consideration of the size correlation and the ratio between the mean and the standard deviations for all the superfamily profiles allows more detailed subdivisions and classification of superfamilies. For example, within the size-independent superfamilies, we distinguished a group that are distributed evenly amongst all the genomes. Within the size-dependent superfamilies we differentiated two groups: linearly distributed and non-linearly distributed. Functional annotation using the COG database was performed for all superfamilies in each of these groups, and this revealed significant differences amongst the three sets of superfamilies. Evenly distributed, size-independent domains are shown to be involved primarily in protein translation and biosynthesis. For the size-dependent superfamilies, linearly distributed superfamilies are involved mainly in metabolism, and non-linearly distributed superfamily domains are involved principally in gene regulation.

Assuntos

Evolução Molecular , Genoma Bacteriano , Proteínas/classificação , Proteínas/genética , Bases de Dados de Proteínas , Fases de Leitura Aberta , Conformação Proteica , Proteínas/química , Estatística como Assunto

17.

The CATH extended protein-family database: providing structural annotations for genome sequences.

Pearl, Frances M G; Lee, David; Bray, James E; Buchan, Daniel W A; Shepherd, Adrian J; Orengo, Christine A.

Protein Sci ; 11(2): 233-44, 2002 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-11790833

RESUMO

An automatic sequence search and analysis protocol (DomainFinder) based on PSI-BLAST and IMPALA, and using conservative thresholds, has been developed for reliably integrating gene sequences from GenBank into their respective structural families within the CATH domain database (http://www.biochem.ucl.ac.uk/bsm/cath_new). DomainFinder assigns a new gene sequence to a CATH homologous superfamily provided that PSI-BLAST identifies a clear relationship to at least one other Protein Data Bank sequence within that superfamily. This has resulted in an expansion of the CATH protein family database (CATH-PFDB v1.6) from 19,563 domain structures to 176,597 domain sequences. A further 50,000 putative homologous relationships can be identified using less stringent cut-offs and these relationships are maintained within neighbour tables in the CATH Oracle database, pending further evidence of their suggested evolutionary relationship. Analysis of the CATH-PFDB has shown that only 15% of the sequence families are close enough to a known structure for reliable homology modeling. IMPALA/PSI-BLAST profiles have been generated for each of the sequence families in the expanded CATH-PFDB and a web server has been provided so that new sequences may be scanned against the profile library and be assigned to a structure and homologous superfamily.

Assuntos

Genoma , Proteínas/química , Algoritmos , Evolução Biológica , Bases de Dados Factuais , Bases de Dados de Proteínas , Conformação Proteica , Dobramento de Proteína , Estrutura Terciária de Proteína , Proteínas/genética , Relação Estrutura-Atividade

18.

Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database.

Buchan, Daniel W A; Shepherd, Adrian J; Lee, David; Pearl, Frances M G; Rison, Stuart C G; Thornton, Janet M; Orengo, Christine A.

Genome Res ; 12(3): 503-14, 2002 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-11875040

RESUMO

We present a novel web-based resource, Gene3D, of precalculated structural assignments to gene sequences and whole genomes. This resource assigns structural domains from the CATH database to whole genes and links these to their curated functional and structural annotations within the CATH domain structure database, the functional Dictionary of Homologous Superfamilies (DHS) and PDBsum. Currently Gene3D provides annotation for 36 complete genomes (two eukaryotes, six archaea, and 28 bacteria). On average, between 30% and 40% of the genes of a given genome can be structurally annotated. Matches to structural domains are found using the profile-based method (PSI-BLAST). and a novel protocol, DRange, is used to resolve conflicts in matches involving different homologous superfamilies.

Assuntos

Bases de Dados Genéticas , Genes/genética , Genoma , Software , Animais , Proteínas Arqueais/genética , Proteínas de Bactérias/genética , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados de Proteínas , Genes Arqueais/genética , Genes Bacterianos/genética , Genoma Arqueal , Genoma Bacteriano , Internet , Estrutura Terciária de Proteína , Proteínas/genética , Homologia de Sequência do Ácido Nucleico , Software/estatística & dados numéricos

19.

The CATH protein family database: a resource for structural and functional annotation of genomes.

Orengo, Christine A; Bray, James E; Buchan, Daniel W A; Harrison, Andrew; Lee, David; Pearl, Frances M G; Sillitoe, Ian; Todd, Annabel E; Thornton, Janet M.

Proteomics ; 2(1): 11-21, 2002 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-11788987

RESUMO

Over the last decade, there have been huge increases in the numbers of protein sequences and structures determined. In parallel, many methods have been developed for recognising similarities between these proteins, arising from their common evolutionary background, and for clustering such relatives into protein families. Here we review some of the protein family resources available to the biologist and describe how these can be used to provide structural and functional annotations for newly determined sequences. In particular we describe recent developments to the CATH domain database of protein structural families which have facilitated genome annotation and which have also revealed important caveats that must be considered when transferring functional data between homologous proteins.

Assuntos

Bases de Dados de Proteínas , Genoma , Conformação Proteica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA