RESUMO
Genome3D (https://www.genome3d.eu) is a freely available resource that provides consensus structural annotations for representative protein sequences taken from a selection of model organisms. Since the last NAR update in 2015, the method of data submission has been overhauled, with annotations now being 'pushed' to the database via an API. As a result, contributing groups are now able to manage their own structural annotations, making the resource more flexible and maintainable. The new submission protocol brings a number of additional benefits including: providing instant validation of data and avoiding the requirement to synchronise releases between resources. It also makes it possible to implement the submission of these structural annotations as an automated part of existing internal workflows. In turn, these improvements facilitate Genome3D being opened up to new prediction algorithms and groups. For the latest release of Genome3D (v2.1), the underlying dataset of sequences used as prediction targets has been updated using the latest reference proteomes available in UniProtKB. A number of new reference proteomes have also been added of particular interest to the wider scientific community: cow, pig, wheat and mycobacterium tuberculosis. These additions, along with improvements to the underlying predictions from contributing resources, has ensured that the number of annotations in Genome3D has nearly doubled since the last NAR update article. The new API has also been used to facilitate the dissemination of Genome3D data into InterPro, thereby widening the visibility of both the annotation data and annotation algorithms.
Assuntos
Proteínas/química , Bases de Dados de Proteínas , Proteínas/classificação , Proteínas/genética , Interface Usuário-ComputadorRESUMO
PhyreRisk is an open-access, publicly accessible web application for interactively bridging genomic, proteomic and structural data facilitating the mapping of human variants onto protein structures. A major advance over other tools for sequence-structure variant mapping is that PhyreRisk provides information on 20,214 human canonical proteins and an additional 22,271 alternative protein sequences (isoforms). Specifically, PhyreRisk provides structural coverage (partial or complete) for 70% (14,035 of 20,214 canonical proteins) of the human proteome, by storing 18,874 experimental structures and 84,818 pre-built models of canonical proteins and their isoforms generated using our in house Phyre2. PhyreRisk reports 55,732 experimentally, multi-validated protein interactions from IntAct and 24,260 experimental structures of protein complexes. Another major feature of PhyreRisk is that, rather than presenting a limited set of precomputed variant-structure mapping of known genetic variants, it allows the user to explore novel variants using, as input, genomic coordinates formats (Ensembl, VCF, reference SNP ID and HGVS notations) and Human Build GRCh37 and GRCh38. PhyreRisk also supports mapping variants using amino acid coordinates and searching for genes or proteins of interest. PhyreRisk is designed to empower researchers to translate genetic data into protein structural information, thereby providing a more comprehensive appreciation of the functional impact of variants. PhyreRisk is freely available at http://phyrerisk.bc.ic.ac.uk.
Assuntos
Biologia Computacional/métodos , Variação Genética , Proteínas/química , Genômica , Humanos , Conformação Proteica , Proteínas/genética , Proteínas/metabolismo , Proteômica , SoftwareRESUMO
A big challenge in current systems biology research arises when different types of data must be accessed from separate sources and visualized using separate tools. The high cognitive load required to navigate such a workflow is detrimental to hypothesis generation. Accordingly, there is a need for a robust research platform that incorporates all data and provides integrated search, analysis, and visualization features through a single portal. Here, we present ePlant (http://bar.utoronto.ca/eplant), a visual analytic tool for exploring multiple levels of Arabidopsis thaliana data through a zoomable user interface. ePlant connects to several publicly available web services to download genome, proteome, interactome, transcriptome, and 3D molecular structure data for one or more genes or gene products of interest. Data are displayed with a set of visualization tools that are presented using a conceptual hierarchy from big to small, and many of the tools combine information from more than one data type. We describe the development of ePlant in this article and present several examples illustrating its integrative features for hypothesis generation. We also describe the process of deploying ePlant as an "app" on Araport. Building on readily available web services, the code for ePlant is freely available for any other biological species research.
Assuntos
Botânica , Software , Estatística como Assunto , Biologia de Sistemas , Sequência de Bases , Cromossomos de Plantas/genética , Regulação da Expressão Gênica de Plantas , Frações Subcelulares/metabolismo , Interface Usuário-ComputadorRESUMO
Autotransporters (ATs) belong to a family of modular proteins secreted by the Type V, subtype a, secretion system (T5aSS) and considered as an important source of virulence factors in lipopolysaccharidic diderm bacteria (archetypical Gram-negative bacteria). While exported by the Sec pathway, the ATs are further secreted across the outer membrane via their own C-terminal translocator forming a ß-barrel, through which the rest of the protein, namely the passenger, can pass. In several ATs, an autochaperone domain (AC) present at the C-terminal region of the passenger and upstream of the translocator was demonstrated as strictly required for proper secretion and folding. However, considering it was functionally characterised and identified only in a handful of ATs, wariness recently fells on the commonality and conservation of this structural element in the T5aSS. To circumvent the issue of sequence divergence and taking advantage of the resolved three-dimensional structure of some ACs, identification of this domain was performed following structural alignment among all AT passengers experimentally resolved by crystallography before searching in a dataset of 1523 ATs. While demonstrating that the AC is indeed a conserved structure found in numerous ATs, phylogenetic analysis further revealed a distribution into deeply rooted branches, from which emerge 20 main clusters. Sequence analysis revealed that an AC could be identified in the large majority of SAATs (self-associating ATs) but not in any LEATs (lipase/esterase ATs) nor in some PATs (protease autotransporters) and PHATs (phosphatase/hydrolase ATs). Structural analysis indicated that an AC was present in passengers exhibiting single-stranded right-handed parallel ß-helix, whatever the type of ß-solenoid, but not with α-helical globular fold. From this investigation, the AC of type 1 appears as a prevalent and conserved structural element exclusively associated to ß-helical AT passenger and should promote further studies about the protein secretion and folding via the T5aSS, especially toward α-helical AT passengers.
RESUMO
The identification of structurally similar proteins can provide a range of biological insights, and accordingly, the alignment of a query protein to a database of experimentally determined protein structures is a technique commonly used in the fields of structural and evolutionary biology. The PhyreStorm Web server has been designed to provide comprehensive, up-to-date and rapid structural comparisons against the Protein Data Bank (PDB) combined with a rich and intuitive user interface. It is intended that this facility will enable biologists inexpert in bioinformatics access to a powerful tool for exploring protein structure relationships beyond what can be achieved by sequence analysis alone. By partitioning the PDB into similar structures, PhyreStorm is able to quickly discard the majority of structures that cannot possibly align well to a query protein, reducing the number of alignments required by an order of magnitude. PhyreStorm is capable of finding 93±2% of all highly similar (TM-score>0.7) structures in the PDB for each query structure, usually in less than 60s. PhyreStorm is available at http://www.sbg.bio.ic.ac.uk/phyrestorm/.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Conformação Proteica , Proteínas/química , InternetRESUMO
Phyre2 is a suite of tools available on the web to predict and analyze protein structure, function and mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools. Phyre2 replaces Phyre, the original version of the server for which we previously published a paper in Nature Protocols. In this updated protocol, we describe Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyze the effect of amino acid variants (e.g., nonsynonymous SNPs (nsSNPs)) for a user's protein sequence. Users are guided through results by a simple interface at a level of detail they determine. This protocol will guide users from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, their domain composition and model quality. A range of additional available tools is described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins that are difficult to model. The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure prediction will be returned between 30 min and 2 h after submission.
Assuntos
Modelos Moleculares , Conformação Proteica , Software , Biologia Computacional , InternetRESUMO
Protein domains are generally thought to correspond to units of evolution. New research raises questions about how such domains are defined with bioinformatics tools and sheds light on how evolution has enabled partial domains to be viable.
Assuntos
Proteínas de Bactérias/genética , Deleção de Genes , Genes Bacterianos , Luciferases/genética , Anotação de Sequência Molecular , Oxirredutases/genética , Estrutura Terciária de Proteína , Proteínas/química , Alinhamento de Sequência , Animais , HumanosRESUMO
Genome3D (http://www.genome3d.eu) is a collaborative resource that provides predicted domain annotations and structural models for key sequences. Since introducing Genome3D in a previous NAR paper, we have substantially extended and improved the resource. We have annotated representatives from Pfam families to improve coverage of diverse sequences and added a fast sequence search to the website to allow users to find Genome3D-annotated sequences similar to their own. We have improved and extended the Genome3D data, enlarging the source data set from three model organisms to 10, and adding VIVACE, a resource new to Genome3D. We have analysed and updated Genome3D's SCOP/CATH mapping. Finally, we have improved the superposition tools, which now give users a more powerful interface for investigating similarities and differences between structural models.
Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Algoritmos , Genômica , Internet , Modelos Moleculares , Estrutura Terciária de Proteína/genética , Análise de Sequência de ProteínaRESUMO
Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html.
Assuntos
Substituição de Aminoácidos , Suscetibilidade a Doenças , Proteínas/química , Software , Criança , Maus-Tratos Infantis , Biologia Computacional/métodos , Humanos , Modelos Moleculares , Mutação de Sentido Incorreto , Fenótipo , Conformação Proteica , Proteínas/genética , Proteínas/metabolismoRESUMO
Coarse-grained (CG) methods for sampling protein conformational space have the potential to increase computational efficiency by reducing the degrees of freedom. The gain in computational efficiency of CG methods often comes at the expense of non-protein like local conformational features. This could cause problems when transitioning to full atom models in a hierarchical framework. Here, a CG potential energy function was validated by applying it to the problem of loop prediction. A novel method to sample the conformational space of backbone atoms was benchmarked using a standard test set consisting of 351 distinct loops. This method used a sequence-independent CG potential energy function representing the protein using [Formula: see text]-carbon positions only and sampling conformations with a Monte Carlo simulated annealing based protocol. Backbone atoms were added using a method previously described and then gradient minimised in the Rosetta force field. Despite the CG potential energy function being sequence-independent, the method performed similarly to methods that explicitly use either fragments of known protein backbones with similar sequences or residue-specific [Formula: see text]/[Formula: see text]-maps to restrict the search space. The method was also able to predict with sub-Angstrom accuracy two out of seven loops from recently solved crystal structures of proteins with low sequence and structure similarity to previously deposited structures in the PDB. The ability to sample realistic loop conformations directly from a potential energy function enables the incorporation of additional geometric restraints and the use of more advanced sampling methods in a way that is not possible to do easily with fragment replacement methods and also enable multi-scale simulations for protein design and protein structure prediction. These restraints could be derived from experimental data or could be design restraints in the case of computational protein design. C++ source code is available for download from http://www.sbg.bio.ic.ac.uk/phyre2/PD2/.
Assuntos
Modelos Químicos , Proteínas/química , Cristalografia por Raios X , Método de Monte Carlo , Conformação ProteicaRESUMO
Coarse-grained protein structure models offer increased efficiency in structural modeling, but these must be coupled with fast and accurate methods to revert to a full-atom structure. Here, we present a novel algorithm to reconstruct mainchain models from C traces. This has been parameterized by fitting Gaussian mixture models (GMMs) to short backbone fragments centered on idealized peptide bonds. The method we have developed is statistically significantly more accurate than several competing methods, both in terms of RMSD values and dihedral angle differences. The method produced Ramachandran dihedral angle distributions that are closer to that observed in real proteins and better Phaser molecular replacement log-likelihood gains. Amino acid residue sidechain reconstruction accuracy using SCWRL4 was found to be statistically significantly correlated to backbone reconstruction accuracy. Finally, the PD2 method was found to produce significantly lower energy full-atom models using Rosetta which has implications for multiscale protein modeling using coarse-grained models. A webserver and C++ source code is freely available for noncommercial use from: http://www.sbg.bio.ic.ac.uk/phyre2/PD2_ca2main/.
Assuntos
Algoritmos , Carbono/química , Simulação de Dinâmica Molecular , Proteínas/química , Software , Conformação ProteicaRESUMO
Hundreds of putative enzymes from Mycobacterium tuberculosis as well as other mycobacteria remain categorized as "conserved hypothetical proteins" or "hypothetical proteins", offering little or no information on their functional role in pathogenic and non-pathogenic processes. In this study we have predicted the fold and 3-D structure of more than 99% of all proteins encoded in the genome of M. tuberculosis H37Rv. Fold-recognition, database search, 3-D modelling was performed using Protein Homology/analogy Recognition Engine V 2.0 (Phyre2). These results are used to tentatively assign potential function for unannotated enzymes and proteins. In summary, fold-recognition and structural homology might be used as a complementary tool in genome annotation efforts and furthermore, it can deliver primary sequence-independent information regarding structure, ligands and even substrate specificity for enzymes that display low primary sequence identity with potential homologues in other species.
Assuntos
Proteínas de Bactérias/fisiologia , Mycobacterium tuberculosis/genética , Proteínas de Bactérias/genética , Biologia Computacional/métodos , Genoma Bacteriano , Humanos , Modelos Moleculares , Mycobacterium tuberculosis/enzimologia , Dobramento de Proteína , Proteoma/fisiologiaRESUMO
Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence-structure-function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker's yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).
Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Genômica , Humanos , Internet , Anotação de Sequência Molecular , Proteínas/química , Proteínas/classificação , Proteínas/genética , SoftwareRESUMO
The acid-labile subunit (ALS) is the main regulator of IGF1 and IGF2 bioavailability. ALS deficiency caused by mutations in the ALS (IGFALS) gene often results in mild short stature in adulthood. Little is known about the ALS structure-function relationship. A structural model built in 1999 suggested a doughnut shape, which has never been observed in the leucine-rich repeat (LRR) superfamily, to which ALS belongs. In this study, we built a new ALS structural model, analysed its glycosylation and charge distribution and studied mechanisms by which missense mutations affect protein structure. We used three structure prediction servers and integrated their results with information derived from ALS experimental studies. The ALS model was built at high confidence using Toll-like receptor protein templates and resembled a horseshoe with an extensively negatively charged concave surface. Enrichment in prolines and disulphide bonds was found at the ALS N- and C-termini. Moreover, seven N-glycosylation sites were identified and mapped. ALS mutations were predicted to affect protein structure by causing loss of hydrophobic interactions (p.Leu134Gln), alteration of the amino acid backbone (p.Leu241Pro, p.Leu172Phe and p.Leu244Phe), loss of disulphide bridges (p.Cys60Ser and p.Cys540Arg), change in structural constrains (p.Pro73Leu), creation of novel glycosylation sites (p.Asp440Asn) or alteration of LRRs (p.Asn276Ser). In conclusion, our ALS structural model was identified as a highly confident prediction by three independent methods and disagrees with the previously published ALS model. The new model allowed us to analyse the ALS core and its caps and to interpret the potential structural effects of ALS mutations.
Assuntos
Proteínas de Transporte/química , Proteínas de Transporte/metabolismo , Nanismo/metabolismo , Glicoproteínas/química , Glicoproteínas/metabolismo , Proteínas de Transporte/genética , Nanismo/genética , Glicoproteínas/genética , Glicosilação , Humanos , Mutação/genética , Mutação de Sentido Incorreto/genética , Estrutura Secundária de ProteínaRESUMO
ATRX is a member of the Snf2 family of chromatin-remodelling proteins and is mutated in an X-linked mental retardation syndrome associated with alpha-thalassaemia (ATR-X syndrome). We have carried out an analysis of 21 disease-causing mutations within the Snf2 domain of ATRX by quantifying the expression of the ATRX protein and placing all missense mutations in their structural context by homology modelling. While demonstrating the importance of protein dosage to the development of ATR-X syndrome, we also identified three mutations which primarily affect function rather than protein structure. We show that all three of these mutant proteins are defective in translocating along DNA while one mutant, uniquely for a human disease-causing mutation, partially uncouples adenosine triphosphate (ATP) hydrolysis from DNA binding. Our results highlight important mechanistic aspects in the development of ATR-X syndrome and identify crucial functional residues within the Snf2 domain of ATRX. These findings are important for furthering our understanding of how ATP hydrolysis is harnessed as useful work in chromatin remodelling proteins and the wider family of nucleic acid translocating motors.
Assuntos
DNA Helicases/genética , DNA Helicases/metabolismo , Mutação/genética , Proteínas Nucleares/genética , Proteínas Nucleares/metabolismo , Ubiquitina-Proteína Ligases/genética , Sequência de Aminoácidos , Animais , Linhagem Celular , DNA Helicases/química , Ativação Enzimática/fisiologia , Humanos , Insetos , Deficiência Intelectual Ligada ao Cromossomo X/enzimologia , Deficiência Intelectual Ligada ao Cromossomo X/genética , Modelos Moleculares , Dados de Sequência Molecular , Proteínas Nucleares/química , Conformação Proteica , Estabilidade Proteica , Alinhamento de Sequência , Translocação Genética/genética , Ubiquitina-Proteína Ligases/química , Proteína Nuclear Ligada ao X , Talassemia alfa/enzimologia , Talassemia alfa/genéticaRESUMO
MOTIVATION: Databases of sequenced genomes are widely used to characterize the structure, function and evolutionary relationships of proteins. The ability to discern such relationships is widely expected to grow as sequencing projects provide novel information, bridging gaps in our map of the protein universe. RESULTS: We have plotted our progress in protein sequencing over the last two decades and found that the rate of novel sequence discovery is in a sustained period of decline. Consequently, PSI-BLAST, the most widely used method to detect remote evolutionary relationships, which relies upon the accumulation of novel sequence data, is now showing a plateau in performance. We interpret this trend as signalling our approach to a representative map of the protein universe and discuss its implications.
Assuntos
Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Bases de Dados de Proteínas , Genoma , Alinhamento de Sequência , Homologia de Sequência de AminoácidosRESUMO
3DLigandSite is a web server for the prediction of ligand-binding sites. It is based upon successful manual methods used in the eighth round of the Critical Assessment of techniques for protein Structure Prediction (CASP8). 3DLigandSite utilizes protein-structure prediction to provide structural models for proteins that have not been solved. Ligands bound to structures similar to the query are superimposed onto the model and used to predict the binding site. In benchmarking against the CASP8 targets 3DLigandSite obtains a Matthew's correlation co-efficient (MCC) of 0.64, and coverage and accuracy of 71 and 60%, respectively, similar results to our manual performance in CASP8. In further benchmarking using a large set of protein structures, 3DLigandSite obtains an MCC of 0.68. The web server enables users to submit either a query sequence or structure. Predictions are visually displayed via an interactive Jmol applet. 3DLigandSite is available for use at http://www.sbg.bio.ic.ac.uk/3dligandsite.
Assuntos
Software , Homologia Estrutural de Proteína , Algoritmos , Sítios de Ligação , Internet , Ligantes , Modelos Moleculares , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Interface Usuário-ComputadorRESUMO
Macromolecular crowding has a profound effect upon biochemical processes in the cell. We have computationally studied the effect of crowding upon protein folding for 12 small domains in a simulated cell using a coarse-grained protein model, which is based upon Langevin dynamics, designed to unify the often disjoint goals of protein folding simulation and structure prediction. The model can make predictions of native conformation with accuracy comparable with that of the best current template-free models. It is fast enough to enable a more extensive analysis of crowding than previously attempted, studying several proteins at many crowding levels and further random repetitions designed to more closely approximate the ensemble of conformations. We found that when crowding approaches 40% excluded volume, the maximum level found in the cell, proteins fold to fewer native-like states. Notably, when crowding is increased beyond this level, there is a sudden failure of protein folding: proteins fix upon a structure more quickly and become trapped in extended conformations. These results suggest that the ability of small protein domains to fold without the help of chaperones may be an important factor in limiting the degree of macromolecular crowding in the cell. Here, we discuss the possible implications regarding the relationship between protein expression level, protein size, chaperone activity and aggregation.
Assuntos
Células/metabolismo , Simulação por Computador , Dobramento de Proteína , Tamanho Celular , Modelos Biológicos , Chaperonas Moleculares , Peso MolecularRESUMO
Structural genomics initiatives are rapidly generating vast numbers of protein structures. Comparative modelling is also capable of producing accurate structural models for many protein sequences. However, for many of the known structures, functions are not yet determined, and in many modelling tasks, an accurate structural model does not necessarily tell us about function. Thus, there is a pressing need for high-throughput methods for determining function from structure. The spatial arrangement of key amino acids in a folded protein, on the surface or buried in clefts, is often the determinants of its biological function. A central aim of molecular biology is to understand the relationship between such substructures or surfaces and biological function, leading both to function prediction and to function design. We present a new general method for discovering the features of binding pockets that confer specificity for particular ligands. Using a recently developed machine-learning technique which couples the rule-discovery approach of inductive logic programming with the statistical learning power of support vector machines, we are able to discriminate, with high precision (90%) and recall (86%) between pockets that bind FAD and those that bind NAD on a large benchmark set given only the geometry and composition of the backbone of the binding pocket without the use of docking. In addition, we learn rules governing this specificity which can feed into protein functional design protocols. An analysis of the rules found suggests that key features of the binding pocket may be tied to conformational freedom in the ligand. The representation is sufficiently general to be applicable to any discriminatory binding problem. All programs and data sets are freely available to non-commercial users at http://www.sbg.bio.ic.ac.uk/svilp_ligand/.
Assuntos
Inteligência Artificial , Engenharia de Proteínas/métodos , Proteínas/química , Motivos de Aminoácidos , Bases de Dados de Proteínas , Flavina-Adenina Dinucleotídeo/química , Flavina-Adenina Dinucleotídeo/metabolismo , Ligantes , Modelos Moleculares , NAD/química , NAD/metabolismo , Ligação Proteica , Conformação Proteica , Proteínas/metabolismo , Reprodutibilidade dos Testes , Relação Estrutura-Atividade , Especificidade por SubstratoRESUMO
Determining the structure and function of a novel protein is a cornerstone of many aspects of modern biology. Over the past decades, a number of computational tools for structure prediction have been developed. It is critical that the biological community is aware of such tools and is able to interpret their results in an informed way. This protocol provides a guide to interpreting the output of structure prediction servers in general and one such tool in particular, the protein homology/analogy recognition engine (Phyre). New profile-profile matching algorithms have improved structure prediction considerably in recent years. Although the performance of Phyre is typical of many structure prediction systems using such algorithms, all these systems can reliably detect up to twice as many remote homologies as standard sequence-profile searching. Phyre is widely used by the biological community, with >150 submissions per day, and provides a simple interface to results. Phyre takes 30 min to predict the structure of a 250-residue protein.