Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Methods Mol Biol ; 1525: 137-164, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-27896721

RESUMO

The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.


Assuntos
Domínios Proteicos/fisiologia , Proteínas/química , Proteínas/metabolismo , Sequência de Aminoácidos , Análise por Conglomerados , Bases de Dados de Proteínas , Domínios Proteicos/genética , Estrutura Terciária de Proteína , Proteínas/genética
2.
Methods Mol Biol ; 453: 123-46, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18712300

RESUMO

The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function, and the extent to which the functional repertoire can vary across the three kingdoms of life. This has led to the creation of a wide range of protein family classifications that aim to group proteins based on their evolutionary relationships. This chapter discusses the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and the chapter shows how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.


Assuntos
Biologia Computacional/métodos , Estrutura Terciária de Proteína , Proteínas/classificação , Bases de Dados de Proteínas , Genômica
3.
Methods Mol Biol ; 426: 3-25, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18542854

RESUMO

The success of the whole genome sequencing projects brought considerable credence to the belief that high-throughput approaches, rather than traditional hypothesis-driven research, would be essential to structurally and functionally annotate the rapid growth in available sequence data within a reasonable time frame. Such observations supported the emerging field of structural genomics, which is now faced with the task of providing a library of protein structures that represent the biological diversity of the protein universe. To run efficiently, structural genomics projects aim to define a set of targets that maximize the potential of each structure discovery whether it represents a novel structure, novel function, or missing evolutionary link. However, not all protein sequences make suitable structural genomics targets: It takes considerably more effort to determine the structure of a protein than the sequence of its gene because of the increased complexity of the methods involved and also because the behavior of targeted proteins can be extremely variable at the different stages in the structural genomics "pipeline." Therefore, structural genomics target selection must identify and prioritize the most suitable candidate proteins for structure determination, avoiding "problematic" proteins while also ensuring the ultimate goals of the project are followed.


Assuntos
Biologia Computacional/tendências , Bases de Dados de Proteínas , Genômica/métodos , Homologia Estrutural de Proteína , Animais , Humanos
4.
BMC Bioinformatics ; 8: 86, 2007 Mar 09.
Artigo em Inglês | MEDLINE | ID: mdl-17349043

RESUMO

BACKGROUND: Structural genomics initiatives were established with the aim of solving protein structures on a large-scale. For many initiatives, such as the Protein Structure Initiative (PSI), the primary aim of target selection is focussed towards structurally characterising protein families which, so far, lack a structural representative. It is therefore of considerable interest to gain insights into the number and distribution of these families, and what efforts may be required to achieve a comprehensive structural coverage across all protein families. RESULTS: In this analysis we have derived a comprehensive domain annotation of the genomes using CATH, Pfam-A and Newfam domain families. We consider what proportions of structurally uncharacterized families are accessible to high-throughput structural genomics pipelines, specifically those targeting families containing multiple prokaryotic orthologues. In measuring the domain coverage of the genomes, we show the benefits of selecting targets from both structurally uncharacterized domain families, whilst in addition, pursuing additional targets from large structurally characterised protein superfamilies. CONCLUSION: This work suggests that such a combined approach to target selection is essential if structural genomics is to achieve a comprehensive structural coverage of the genomes, leading to greater insights into structure and the mechanisms that underlie protein evolution.


Assuntos
Bases de Dados de Proteínas , Genoma/genética , Genômica , Animais , Genômica/métodos , Humanos , Família Multigênica , Análise de Sequência de Proteína/métodos , Homologia Estrutural de Proteína
5.
Philos Trans R Soc Lond B Biol Sci ; 361(1467): 425-40, 2006 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-16524831

RESUMO

New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.


Assuntos
Evolução Molecular , Proteínas/química , Proteínas/metabolismo , Algoritmos , Biologia Computacional , Bases de Dados Factuais , Conformação Proteica
6.
Nucleic Acids Res ; 34(3): 1066-80, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-16481312

RESUMO

We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.


Assuntos
Biologia Computacional , Genoma , Genômica , Proteínas/química , Proteínas/classificação , Algoritmos , Animais , Células Eucarióticas/metabolismo , Evolução Molecular , Humanos , Família Multigênica , Estrutura Terciária de Proteína , Proteínas/metabolismo , Proteômica
7.
Nucleic Acids Res ; 33(Web Server issue): W36-8, 2005 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-15980489

RESUMO

A number of state-of-the-art protein structure prediction servers have been developed by researchers working in the Bioinformatics Unit at University College London. The popular PSIPRED server allows users to perform secondary structure prediction, transmembrane topology prediction and protein fold recognition. More recent servers include DISOPRED for the prediction of protein dynamic disorder and DomPred for domain boundary prediction. These servers are available from our software home page at http://bioinf.cs.ucl.ac.uk/software.html.


Assuntos
Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Software , Biologia Computacional , Humanos , Internet , Londres , Proteínas de Membrana/química , Modelos Moleculares , Proteína de Ligação a Regiões Ricas em Polipirimidinas/química , Dobramento de Proteína
8.
J Mol Biol ; 348(5): 1235-60, 2005 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-15854658

RESUMO

The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.


Assuntos
Biologia Computacional/tendências , Bases de Dados de Proteínas , Genômica/métodos , Conformação Proteica , Animais , Genoma , Humanos , Análise de Sequência de Proteína , Homologia Estrutural de Proteína
9.
Proteins ; 59(3): 603-15, 2005 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-15768405

RESUMO

Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.


Assuntos
Bases de Dados de Ácidos Nucleicos , Genoma , Proteínas/química , Sequência de Aminoácidos , Análise por Conglomerados , Enzimas/química , Enzimas/genética , Enzimas/metabolismo , Família Multigênica , Biossíntese de Proteínas , Proteínas/classificação , Proteínas/genética
10.
Bioinformatics ; 20(14): 2288-95, 2004 Sep 22.
Artigo em Inglês | MEDLINE | ID: mdl-15201178

RESUMO

MOTIVATION: Target selection strategies for structural genomic projects must be able to prioritize gene regions on the basis of significant sequence similarity with proteins that have already been structurally determined. With the rapid development of protein comparison software a robust prioritization scheme should be independent of the choice of algorithm and be able to incorporate different sequence similarity thresholds. RESULTS: A robust target selection strategy has been developed that can assign a priority level to all genes in any genome. Structural assignments to genome sequences are calculated at two thresholds and six levels (1-6) describe the prioritization of all whole genes and partial gene regions. This simple two-threshold approach can be implemented with any fold recognition or homology detection algorithms. The results for 10 genomes are presented using the SSEARCH and PSI-BLAST programs. AVAILABILITY: Programs are available on request from the authors.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Marcação de Genes/métodos , Genes/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Estrutura Terciária de Proteína , Homologia de Sequência de Aminoácidos , Relação Estrutura-Atividade
11.
Protein Sci ; 11(12): 2814-24, 2002 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-12441380

RESUMO

The elucidation of the domain content of a given protein sequence in the absence of determined structure or significant sequence homology to known domains is an important problem in structural biology. Here we address how successfully the delineation of continuous domains can be accomplished in the absence of sequence homology using simple baseline methods, an existing prediction algorithm (Domain Guess by Size), and a newly developed method (DomSSEA). The study was undertaken with a view to measuring the usefulness of these prediction methods in terms of their application to fully automatic domain assignment. Thus, the sensitivity of each domain assignment method was measured by calculating the number of correctly assigned top scoring predictions. We have implemented a new continuous domain identification method using the alignment of predicted secondary structures of target sequences against observed secondary structures of chains with known domain boundaries as assigned by Class Architecture Topology Homology (CATH). Taking top predictions only, the success rate of the method in correctly assigning domain number to the representative chain set is 73.3%. The top prediction for domain number and location of domain boundaries was correct for 24% of the multidomain set (+/-20 residues). These results have been put into context in relation to the results obtained from the other prediction methods assessed.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Sensibilidade e Especificidade , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA