RESUMO
Many proteins fold into highly regular and repetitive three dimensional structures. The analysis of structural patterns and repeated elements is fundamental to understand protein function and evolution. We present recent improvements to the CE-Symm tool for systematically detecting and analyzing the internal symmetry and structural repeats in proteins. In addition to the accurate detection of internal symmetry, the tool is now capable of i) reporting the type of symmetry, ii) identifying the smallest repeating unit, iii) describing the arrangement of repeats with transformation operations and symmetry axes, and iv) comparing the similarity of all the internal repeats at the residue level. CE-Symm 2.0 helps the user investigate proteins with a robust and intuitive sequence-to-structure analysis, with many applications in protein classification, functional annotation and evolutionary studies. We describe the algorithmic extensions of the method and demonstrate its applications to the study of interesting cases of protein evolution.
Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas/química , Software , Sequência de Aminoácidos , Bases de Dados de Proteínas , Modelos Moleculares , Análise de Sequência de ProteínaRESUMO
BioJava is an open-source project that provides a Java library for processing biological data. The project aims to simplify bioinformatic analyses by implementing parsers, data structures, and algorithms for common tasks in genomics, structural biology, ontologies, phylogenetics, and more. Since 2012, we have released two major versions of the library (4 and 5) that include many new features to tackle challenges with increasingly complex macromolecular structure data. BioJava requires Java 8 or higher and is freely available under the LGPL 2.1 license. The project is hosted on GitHub at https://github.com/biojava/biojava. More information and documentation can be found online on the BioJava website (http://www.biojava.org) and tutorial (https://github.com/biojava/biojava-tutorial). All inquiries should be directed to the GitHub page or the BioJava mailing list (http://lists.open-bio.org/mailman/listinfo/biojava-l).
Assuntos
Biologia Computacional/métodos , Acesso à Informação , Algoritmos , Biblioteca Gênica , Genoma/genética , Genômica , Armazenamento e Recuperação da Informação , Internet , SoftwareRESUMO
A correct assessment of the quaternary structure of proteins is a fundamental prerequisite to understanding their function, physico-chemical properties and mode of interaction with other proteins. Currently about 90% of structures in the Protein Data Bank are crystal structures, in which the correct quaternary structure is embedded in the crystal lattice among a number of crystal contacts. Computational methods are required to 1) classify all protein-protein contacts in crystal lattices as biologically relevant or crystal contacts and 2) provide an assessment of how the biologically relevant interfaces combine into a biological assembly. In our previous work we addressed the first problem with our EPPIC (Evolutionary Protein Protein Interface Classifier) method. Here, we present our solution to the second problem with a new method that combines the interface classification results with symmetry and topology considerations. The new algorithm enumerates all possible valid assemblies within the crystal using a graph representation of the lattice and predicts the most probable biological unit based on the pairwise interface scoring. Our method achieves 85% precision (ranging from 76% to 90% for different oligomeric types) on a new dataset of 1,481 biological assemblies with consensus of PDB annotations. Although almost the same precision is achieved by PISA, currently the most popular quaternary structure assignment method, we show that, due to the fundamentally different approach to the problem, the two methods are complementary and could be combined to improve biological assembly assignments. The software for the automatic assessment of protein assemblies (EPPIC version 3) has been made available through a web server at http://www.eppic-web.org.
Assuntos
Estrutura Quaternária de Proteína , Proteínas/química , Algoritmos , Biologia Computacional , Cristalografia por Raios X/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Modelos Moleculares , Domínios e Motivos de Interação entre Proteínas , SoftwareRESUMO
We present the results of the first independent assessment of protein assemblies in CASP. A total of 1624 oligomeric models were submitted by 108 predictor groups for the 30 oligomeric targets in the CASP12 edition. We evaluated the accuracy of oligomeric predictions by comparison to their reference structures at the interface patch and residue contact levels. We find that interface patches are more reliably predicted than the specific residue contacts. Whereas none of the 15 hard oligomeric targets have successful predictions for the residue contacts at the interface, six have models with resemblance in the interface patch. Successful predictions of interface patch and contacts exist for all targets suitable for homology modeling, with at least one group improving over the best available template for each target. However, the participation in protein assembly prediction is low and uneven. Three human groups are closely ranked at the top by overall performance, but a server outperforms all other predictors for targets suitable for homology modeling. The state of the art of protein assembly prediction methods is in development and has apparent room for improvement, especially for assemblies without templates.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Modelos Moleculares , Simulação de Dinâmica Molecular , Conformação Proteica , Proteínas/química , Algoritmos , Humanos , Dobramento de Proteína , Análise de Sequência de ProteínaRESUMO
Our goal is to answer the question: compared with experimental structures, how useful are predicted models for functional annotation? We assessed the functional utility of predicted models by comparing the performances of a suite of methods for functional characterization on the predictions and the experimental structures. We identified 28 sites in 25 protein targets to perform functional assessment. These 28 sites included nine sites with known ligand binding (holo-sites), nine sites that are expected or suggested by experimental authors for small molecule binding (apo-sites), and Ten sites containing important motifs, loops, or key residues with important disease-associated mutations. We evaluated the utility of the predictions by comparing their microenvironments to the experimental structures. Overall structural quality correlates with functional utility. However, the best-ranked predictions (global) may not have the best functional quality (local). Our assessment provides an ability to discriminate between predictions with high structural quality. When assessing ligand-binding sites, most prediction methods have higher performance on apo-sites than holo-sites. Some servers show consistently high performance for certain types of functional sites. Finally, many functional sites are associated with protein-protein interaction. We also analyzed biologically relevant features from the protein assemblies of two targets where the active site spanned the protein-protein interface. For the assembly targets, we find that the features in the models are mainly determined by the choice of template.
Assuntos
Produtos Biológicos/metabolismo , Biologia Computacional/métodos , Modelos Moleculares , Modelos Estatísticos , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , Sítios de Ligação , Domínio Catalítico , Humanos , Ligantes , Ligação ProteicaRESUMO
Modern structural biology still draws the vast majority of information from crystallography, a technique where the objects being investigated are embedded in a crystal lattice. Given the complexity and variety of those objects, it becomes fundamental to computationally assess which of the interfaces in the lattice are biologically relevant and which are simply crystal contacts. Since the mid-1990s, several approaches have been applied to obtain high-accuracy classification of crystal contacts and biological protein-protein interfaces. This review provides an overview of the concepts and main approaches to protein interface classification: thermodynamic estimation of interface stability, evolutionary approaches based on conservation of interface residues, and co-occurrence of the interface across different crystal forms. Among the three categories, evolutionary approaches offer the strongest promise for improvement, thanks to the incessant growth in sequence knowledge. Importantly, protein interface classification algorithms can also be used on multimeric structures obtained using other high-resolution techniques or for protein assembly design or validation purposes. A key issue linked to protein interface classification is the identification of the biological assembly of a crystal structure and the analysis of its symmetry. Here, we highlight the most important concepts and problems to be overcome in assembly prediction. Over the next few years, tools and concepts of interface classification will probably become more frequently used and integrated in several areas of structural biology and structural bioinformatics. Among the main challenges for the future are better addressing of weak interfaces and the application of interface classification concepts to prediction problems like protein-protein docking.
Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas/química , Cristalografia por Raios X , Humanos , Modelos Moleculares , Ligação Proteica , Conformação ProteicaRESUMO
MOTIVATION: Circular permutation is an important type of protein rearrangement. Natural circular permutations have implications for protein function, stability and evolution. Artificial circular permutations have also been used for protein studies. However, such relationships are difficult to detect for many sequence and structure comparison algorithms and require special consideration. RESULTS: We developed a new algorithm, called Combinatorial Extension for Circular Permutations (CE-CP), which allows the structural comparison of circularly permuted proteins. CE-CP was designed to be user friendly and is integrated into the RCSB Protein Data Bank. It was tested on two collections of circularly permuted proteins. Pairwise alignments can be visualized both in a desktop application or on the web using Jmol and exported to other programs in a variety of formats. AVAILABILITY AND IMPLEMENTATION: The CE-CP algorithm can be accessed through the RCSB website at http://www.rcsb.org/pdb/workbench/workbench.do. Source code is available under the LGPL 2.1 as part of BioJava 3 (http://biojava.org; http://github.com/biojava/biojava). CONTACT: sbliven@ucsd.edu or info@rcsb.org.
Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Dinaminas/química , Homologia Estrutural de Proteína , Humanos , Linguagens de Programação , Estrutura Terciária de Proteína , Análise de Sequência de Proteína/métodosRESUMO
BACKGROUND: Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. RESULTS: An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\#downloads. CONCLUSIONS: Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteínas/química , Sequência de Aminoácidos , Sequência Conservada , Modelos Moleculares , Ligação Proteica , Estrutura Secundária de Proteína , Proteínas/metabolismoRESUMO
UNLABELLED: BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality. RESULTS: BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model. AVAILABILITY: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.
Assuntos
Proteínas/química , Análise de Sequência , Software , Aminoácidos/química , Biologia Computacional , Genômica , Conformação Proteica , Processamento de Proteína Pós-Traducional , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de ProteínaRESUMO
Low-pass spectral analysis (LPSA) is a recently developed dynamics retrieval algorithm showing excellent retrieval properties when applied to model data affected by extreme incompleteness and stochastic weighting. In this work, we apply LPSA to an experimental time-resolved serial femtosecond crystallography (TR-SFX) dataset from the membrane protein bacteriorhodopsin (bR) and analyze its parametric sensitivity. While most dynamical modes are contaminated by nonphysical high-frequency features, we identify two dominant modes, which are little affected by spurious frequencies. The dynamics retrieved using these modes shows an isomerization signal compatible with previous findings. We employ synthetic data with increasing timing uncertainty, increasing incompleteness level, pixel-dependent incompleteness, and photon counting errors to investigate the root cause of the high-frequency contamination of our TR-SFX modes. By testing a range of methods, we show that timing errors comparable to the dynamical periods to be retrieved produce a smearing of dynamical features, hampering dynamics retrieval, but with no introduction of spurious components in the solution, when convergence criteria are met. Using model data, we are able to attribute the high-frequency contamination of low-order dynamical modes to the high levels of noise present in the data. Finally, we propose a method to handle missing observations that produces a substantial dynamics retrieval improvement from synthetic data with a significant static component. Reprocessing of the bR TR-SFX data using the improved method yields dynamical movies with strong isomerization signals compatible with previous findings.
RESUMO
SUMMARY: With the continuous growth of the RCSB Protein Data Bank (PDB), providing an up-to-date systematic structure comparison of all protein structures poses an ever growing challenge. Here, we present a comparison tool for calculating both 1D protein sequence and 3D protein structure alignments. This tool supports various applications at the RCSB PDB website. First, a structure alignment web service calculates pairwise alignments. Second, a stand-alone application runs alignments locally and visualizes the results. Third, pre-calculated 3D structure comparisons for the whole PDB are provided and updated on a weekly basis. These three applications allow users to discover novel relationships between proteins available either at the RCSB PDB or provided by the user. AVAILABILITY AND IMPLEMENTATION: A web user interface is available at http://www.rcsb.org/pdb/workbench/workbench.do. The source code is available under the LGPL license from http://www.biojava.org. A source bundle, prepared for local execution, is available from http://source.rcsb.org CONTACT: andreas@sdsc.edu; pbourne@ucsd.edu.
Assuntos
Bases de Dados de Proteínas , Software , Homologia Estrutural de Proteína , Algoritmos , Sequência de Aminoácidos , Internet , Proteínas/química , Interface Usuário-ComputadorRESUMO
The Tandem Repeat Annotation Library (TRAL) focuses on analyzing tandem repeat units in genomic sequences. TRAL can integrate and harmonize tandem repeat annotations from a large number of external tools, and provides a statistical model for evaluating and filtering the detected repeats. TRAL version 2.0 includes new features such as a module for identifying repeats from circular profile hidden Markov models, a new repeat alignment method based on the progressive Poisson Indel Process, an improved installation procedure and a docker container. TRAL is an open-source Python 3 library and is available, together with documentation and tutorials via vital-it.ch/software/tral.
RESUMO
Covalently bound protein kinase inhibitors have been frequently designed to target noncatalytic cysteines at the ATP binding site. Thus, it is important to know if a given cysteine can form a covalent bond. Here we combine a function-site interaction fingerprint method and DFT calculations to determine the potential of cysteines to form a covalent interaction with an inhibitor. By harnessing the human structural kinome, a comprehensive structure-based binding site cysteine data set was assembled. The orientation of the cysteine thiol group indicates which cysteines can potentially form covalent bonds. These covalent inhibitor easy-available cysteines are located within five regions: P-loop, roof of pocket, front pocket, catalytic-2 of the catalytic loop, and DFG-3 close to the DFG peptide. In an independent test set these cysteines covered 95% of covalent kinase inhibitors. This study provides new insights into cysteine reactivity and preference which is important for the prospective development of covalent kinase inhibitors.
Assuntos
Cisteína/metabolismo , Inibidores de Proteínas Quinases/química , Inibidores de Proteínas Quinases/farmacologia , Proteínas Quinases/metabolismo , Sítios de Ligação , Cisteína/análise , Humanos , Conformação Proteica/efeitos dos fármacos , Proteínas Quinases/química , Relação Estrutura-AtividadeRESUMO
Vascular endothelial growth factors (VEGFs) regulate blood and lymph vessel development upon activation of three receptor tyrosine kinases: VEGFR-1, -2, and -3. Partial structures of VEGFR/VEGF complexes based on single-particle electron microscopy, small-angle X-ray scattering, and X-ray crystallography revealed the location of VEGF binding and domain arrangement of individual receptor subdomains. Here, we describe the structure of the full-length VEGFR-1 extracellular domain in complex with VEGF-A at 4 Å resolution. We combined X-ray crystallography, single-particle electron microscopy, and molecular modeling for structure determination and validation. The structure reveals the molecular details of ligand-induced receptor dimerization, in particular of homotypic receptor interactions in immunoglobulin homology domains 4, 5, and 7. Functional analyses of ligand binding and receptor activation confirm the relevance of these homotypic contacts and identify them as potential therapeutic sites to allosterically inhibit VEGFR-1 activity.
Assuntos
Fator A de Crescimento do Endotélio Vascular/química , Receptor 1 de Fatores de Crescimento do Endotélio Vascular/química , Sequência de Aminoácidos , Sítios de Ligação , Clonagem Molecular , Cristalografia por Raios X , Expressão Gênica , Humanos , Ligantes , Microscopia Eletrônica , Modelos Moleculares , Ligação Proteica , Conformação Proteica em alfa-Hélice , Conformação Proteica em Folha beta , Domínios e Motivos de Interação entre Proteínas , Multimerização Proteica , Proteínas Recombinantes/química , Proteínas Recombinantes/genética , Proteínas Recombinantes/metabolismo , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos , Termodinâmica , Fator A de Crescimento do Endotélio Vascular/genética , Fator A de Crescimento do Endotélio Vascular/metabolismo , Receptor 1 de Fatores de Crescimento do Endotélio Vascular/genética , Receptor 1 de Fatores de Crescimento do Endotélio Vascular/metabolismoRESUMO
BACKGROUND: The success of genome-scale models (GEMs) can be attributed to the high-quality, bottom-up reconstructions of metabolic, protein synthesis, and transcriptional regulatory networks on an organism-specific basis. Such reconstructions are biochemically, genetically, and genomically structured knowledge bases that can be converted into a mathematical format to enable a myriad of computational biological studies. In recent years, genome-scale reconstructions have been extended to include protein structural information, which has opened up new vistas in systems biology research and empowered applications in structural systems biology and systems pharmacology. RESULTS: Here, we present the generation, application, and dissemination of genome-scale models with protein structures (GEM-PRO) for Escherichia coli and Thermotoga maritima. We show the utility of integrating molecular scale analyses with systems biology approaches by discussing several comparative analyses on the temperature dependence of growth, the distribution of protein fold families, substrate specificity, and characteristic features of whole cell proteomes. Finally, to aid in the grand challenge of big data to knowledge, we provide several explicit tutorials of how protein-related information can be linked to genome-scale models in a public GitHub repository ( https://github.com/SBRG/GEMPro/tree/master/GEMPro_recon/). CONCLUSIONS: Translating genome-scale, protein-related information to structured data in the format of a GEM provides a direct mapping of gene to gene-product to protein structure to biochemical reaction to network states to phenotypic function. Integration of molecular-level details of individual proteins, such as their physical, chemical, and structural properties, further expands the description of biochemical network-level properties, and can ultimately influence how to model and predict whole cell phenotypes as well as perform comparative systems biology approaches to study differences between organisms. GEM-PRO offers insight into the physical embodiment of an organism's genotype, and its use in this comparative framework enables exploration of adaptive strategies for these organisms, opening the door to many new lines of research. With these provided tools, tutorials, and background, the reader will be in a position to run GEM-PRO for their own purposes.
Assuntos
Escherichia coli/genética , Escherichia coli/metabolismo , Proteômica , Biologia de Sistemas/métodos , Thermotoga maritima/genética , Thermotoga maritima/metabolismo , Escherichia coli/crescimento & desenvolvimento , Proteínas de Escherichia coli/química , Proteínas de Escherichia coli/genética , Proteínas de Escherichia coli/metabolismo , Modelos Biológicos , Modelos Moleculares , Conformação Proteica , Homologia de Sequência de Aminoácidos , Temperatura , Thermotoga maritima/crescimento & desenvolvimentoRESUMO
Symmetry is an important feature of protein tertiary and quaternary structures that has been associated with protein folding, function, evolution, and stability. Its emergence and ensuing prevalence has been attributed to gene duplications, fusion events, and subsequent evolutionary drift in sequence. This process maintains structural similarity and is further supported by this study. To further investigate the question of how internal symmetry evolved, how symmetry and function are related, and the overall frequency of internal symmetry, we developed an algorithm, CE-Symm, to detect pseudo-symmetry within the tertiary structure of protein chains. Using a large manually curated benchmark of 1007 protein domains, we show that CE-Symm performs significantly better than previous approaches. We use CE-Symm to build a census of symmetry among domain superfamilies in SCOP and note that 18% of all superfamilies are pseudo-symmetric. Our results indicate that more domains are pseudo-symmetric than previously estimated. We establish a number of recurring types of symmetry-function relationships and describe several characteristic cases in detail. With the use of the Enzyme Commission classification, symmetry was found to be enriched in some enzyme classes but depleted in others. CE-Symm thus provides a methodology for a more complete and detailed study of the role of symmetry in tertiary protein structure [availability: CE-Symm can be run from the Web at http://source.rcsb.org/jfatcatserver/symmetry.jsp. Source code and software binaries are also available under the GNU Lesser General Public License (version 2.1) at https://github.com/rcsb/symmetry. An interactive census of domains identified as symmetric by CE-Symm is available from http://source.rcsb.org/jfatcatserver/scopResults.jsp].