RESUMEN
The Protein Ensemble Database (PED) (URL: https://proteinensemble.org) is the primary resource for depositing structural ensembles of intrinsically disordered proteins. This updated version of PED reflects advancements in the field, denoting a continual expansion with a total of 461 entries and 538 ensembles, including those generated without explicit experimental data through novel machine learning (ML) techniques. With this significant increment in the number of ensembles, a few yet-unprecedented new entries entered the database, including those also determined or refined by electron paramagnetic resonance or circular dichroism data. In addition, PED was enriched with several new features, including a novel deposition service, improved user interface, new database cross-referencing options and integration with the 3D-Beacons network-all representing efforts to improve the FAIRness of the database. Foreseeably, PED will keep growing in size and expanding with new types of ensembles generated by accurate and fast ML-based generative models and coarse-grained simulations. Therefore, among future efforts, priority will be given to further develop the database to be compatible with ensembles modeled at a coarse-grained level.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas , Proteínas Intrínsecamente Desordenadas/química , Simulación de Dinámica Molecular , Resonancia Magnética Nuclear Biomolecular , Conformación ProteicaRESUMEN
Residue interaction networks (RINs) are a valuable approach for representing contacts in protein structures. RINs have been widely used in various research areas, including the analysis of mutation effects, domain-domain communication, catalytic activity, and molecular dynamics simulations. The RING server is a powerful tool to calculate non-covalent molecular interactions based on geometrical parameters, providing high-quality and reliable results. Here, we introduce RING 4.0, which includes significant enhancements for identifying both covalent and non-covalent bonds in protein structures. It now encompasses seven different interaction types, with the addition of π-hydrogen, halogen bonds and metal ion coordination sites. The definitions of all available bond types have also been refined and RING can now process the complete PDB chemical component dictionary (over 35000 different molecules) which provides atom names and covalent connectivity information for all known ligands. Optimization of the software has improved execution time by an order of magnitude. The RING web server has been redesigned to provide a more engaging and interactive user experience, incorporating new visualization tools. Users can now visualize all types of interactions simultaneously in the structure viewer and network component. The web server, including extensive help and tutorials, is available from URL: https://ring.biocomputingup.it/.
Asunto(s)
Programas Informáticos , Proteínas/química , Proteínas/metabolismo , Internet , Ligandos , Conformación ProteicaRESUMEN
Intrinsic disorder (ID) in proteins is well-established in structural biology, with increasing evidence for its involvement in essential biological processes. As measuring dynamic ID behavior experimentally on a large scale remains difficult, scores of published ID predictors have tried to fill this gap. Unfortunately, their heterogeneity makes it difficult to compare performance, confounding biologists wanting to make an informed choice. To address this issue, the Critical Assessment of protein Intrinsic Disorder (CAID) benchmarks predictors for ID and binding regions as a community blind-test in a standardized computing environment. Here we present the CAID Prediction Portal, a web server executing all CAID methods on user-defined sequences. The server generates standardized output and facilitates comparison between methods, producing a consensus prediction highlighting high-confidence ID regions. The website contains extensive documentation explaining the meaning of different CAID statistics and providing a brief description of all methods. Predictor output is visualized in an interactive feature viewer and made available for download in a single table, with the option to recover previous sessions via a private dashboard. The CAID Prediction Portal is a valuable resource for researchers interested in studying ID in proteins. The server is available at the URL: https://caid.idpcentral.org.
Asunto(s)
Biología Molecular , Proteínas , Benchmarking , Consenso , Proteínas/química , Programas Informáticos , Proteínas Intrínsecamente DesordenadasRESUMEN
The MobiDB database (URL: https://mobidb.org/) is a knowledge base of intrinsically disordered proteins. MobiDB aggregates disorder annotations derived from the literature and from experimental evidence along with predictions for all known protein sequences. MobiDB generates new knowledge and captures the functional significance of disordered regions by processing and combining complementary sources of information. Since its first release 10 years ago, the MobiDB database has evolved in order to improve the quality and coverage of protein disorder annotations and its accessibility. MobiDB has now reached its maturity in terms of data standardization and visualization. Here, we present a new release which focuses on the optimization of user experience and database content. The major advances compared to the previous version are the integration of AlphaFoldDB predictions and the re-implementation of the homology transfer pipeline, which expands manually curated annotations by two orders of magnitude. Finally, the entry page has been restyled in order to provide an overview of the available annotations along with two separate views that highlight structural disorder evidence and functions associated with different binding modes.
Asunto(s)
Proteínas Intrínsecamente Desordenadas , Proteínas Intrínsecamente Desordenadas/química , Bases de Datos de Proteínas , Anotación de Secuencia Molecular , Secuencia de Aminoácidos , Bases del Conocimiento , Conformación ProteicaRESUMEN
RING-PyMOL is a plugin for PyMOL providing a set of analysis tools for structural ensembles and molecular dynamic simulations. RING-PyMOL combines residue interaction networks, as provided by the RING software, with structural clustering to enhance the analysis and visualization of the conformational complexity. It combines precise calculation of non-covalent interactions with the power of PyMOL to manipulate and visualize protein structures. The plugin identifies and highlights correlating contacts and interaction patterns that can explain structural allostery, active sites, and structural heterogeneity connected with molecular function. It is easy to use and extremely fast, processing and rendering hundreds of models and long trajectories in seconds. RING-PyMOL generates a number of interactive plots and output files for use with external tools. The underlying RING software has been improved extensively. It is 10 times faster, can process mmCIF files and it identifies typed interactions also for nucleic acids. AVAILABILITY AND IMPLEMENTATION: https://github.com/BioComputingUP/ring-pymol.
Asunto(s)
Simulación de Dinámica Molecular , Programas Informáticos , Proteínas/química , Análisis por Conglomerados , Dominio CatalíticoRESUMEN
Fuzzy interactions are specific, variable contacts between proteins and other biomolecules (proteins, DNA, RNA, small molecules) formed in accord to the cellular context. Fuzzy interactions have recently been demonstrated to regulate biomolecular condensates generated by liquid-liquid phase separation. The FuzDB v4.0 database (https://fuzdb.org) assembles experimentally identified examples of fuzzy interactions, where disordered regions mediate functionally important, context-dependent contacts between the partners in stoichiometric and higher-order assemblies. The new version of FuzDB establishes cross-links with databases on structure (PDB, BMRB, PED), function (ELM, UniProt) and biomolecular condensates (PhaSepDB, PhaSePro, LLPSDB). FuzDB v4.0 is a source to decipher molecular basis of complex cellular interaction behaviors, including those in protein droplets.
Asunto(s)
ADN/metabolismo , Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/metabolismo , ARN/metabolismo , Programas Informáticos , Secuencia de Aminoácidos , Sitios de Unión , ADN/química , ADN/genética , Humanos , Proteínas Intrínsecamente Desordenadas/química , Proteínas Intrínsecamente Desordenadas/genética , Modelos Moleculares , Anotación de Secuencia Molecular , Transición de Fase , Unión Proteica , Conformación Proteica en Hélice alfa , Conformación Proteica en Lámina beta , Dominios y Motivos de Interacción de Proteínas , ARN/química , ARN/genética , Relación Estructura-ActividadRESUMEN
Residue interaction networks (RINs) are used to represent residue contacts in protein structures. Thanks to the advances in network theory, RINs have been proved effective as an alternative to coordinate data in the analysis of complex systems. The RING server calculates high quality and reliable non-covalent molecular interactions based on geometrical parameters. Here, we present the new RING 3.0 version extending the previous functionality in several ways. The underlying software library has been re-engineered to improve speed by an order of magnitude. RING now also supports the mmCIF format and provides typed interactions for the entire PDB chemical component dictionary, including nucleic acids. Moreover, RING now employs probabilistic graphs, where multiple conformations (e.g. NMR or molecular dynamics ensembles) are mapped as weighted edges, opening up new ways to analyze structural data. The web interface has been expanded to include a simultaneous view of the RIN alongside a structure viewer, with both synchronized and clickable. Contact evolution across models (or time) is displayed as a heatmap and can help in the discovery of correlating interaction patterns. The web server, together with an extensive help and tutorial, is available from URL: https://ring.biocomputingup.it/.
Asunto(s)
Proteínas , Programas Informáticos , Internet , Simulación de Dinámica Molecular , Conformación Proteica , Proteínas/química , ProbabilidadRESUMEN
The Database of Intrinsically Disordered Proteins (DisProt, URL: https://disprot.org) is the major repository of manually curated annotations of intrinsically disordered proteins and regions from the literature. We report here recent updates of DisProt version 9, including a restyled web interface, refactored Intrinsically Disordered Proteins Ontology (IDPO), improvements in the curation process and significant content growth of around 30%. Higher quality and consistency of annotations is provided by a newly implemented reviewing process and training of curators. The increased curation capacity is fostered by the integration of DisProt with APICURON, a dedicated resource for the proper attribution and recognition of biocuration efforts. Better interoperability is provided through the adoption of the Minimum Information About Disorder (MIADE) standard, an active collaboration with the Gene Ontology (GO) and Evidence and Conclusion Ontology (ECO) consortia and the support of the ELIXIR infrastructure.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/metabolismo , Anotación de Secuencia Molecular , Programas Informáticos , Secuencia de Aminoácidos , ADN/genética , ADN/metabolismo , Conjuntos de Datos como Asunto , Ontología de Genes , Humanos , Internet , Proteínas Intrínsecamente Desordenadas/química , Proteínas Intrínsecamente Desordenadas/genética , Unión Proteica , ARN/genética , ARN/metabolismoRESUMEN
Tandem repeats (TRs) in protein sequences are consecutive, highly similar sequence motifs. Some types of TRs fold into structural units that pack together in ensembles, forming either an (open) elongated domain or a (closed) propeller, where the last unit of the ensemble packs against the first one. Here, we examine TR proteins (TRPs) to see how their sequence, structure, and evolutionary properties favor them for a function as mediators of protein interactions. Our observations suggest that TRPs bind other proteins using large, structured surfaces like globular domains; in particular, open-structured TR ensembles are favored by flexible termini and the possibility to tightly coil against their targets. While, intuitively, open ensembles of TRs seem prone to evolve due to their potential to accommodate insertions and deletions of units, these evolutionary events are unexpectedly rare, suggesting that they are advantageous for the emergence of the ancestral sequence but are early fixed. We hypothesize that their flexibility makes it easier for further proteins to adapt to interact with them, which would explain their large number of protein interactions. We provide insight into the properties of open TR ensembles, which make them scaffolds for alternative protein complexes to organize genes, RNA and proteins.
Asunto(s)
Proteínas , Secuencias Repetidas en Tándem , Proteínas/química , Secuencia de AminoácidosRESUMEN
Structured tandem repeats proteins (STRPs) are a specific kind of tandem repeat proteins characterized by a modular and repetitive three-dimensional structure arrangement. The majority of STRPs adopt solenoid structures, but with the increasing availability of experimental structures and high-quality predicted structural models, more STRP folds can be characterized. Here, we describe "Box repeats", an overlooked STRP fold present in the DNA sliding clamp processivity factors, which has eluded classification although structural data has been available since the late 1990s. Each Box repeat is a ßâºßßß module of about 60 residues, which forms a class V "beads-on-a-string" type STRP. The number of repeats present in processivity factors is organism dependent. Monomers of PCNA proteins in both Archaea and Eukarya have 4 repeats, while the monomers of bacterial beta-sliding clamps have 6 repeats. This new repeat fold has been added to the RepeatsDB database, which now provides structural annotation for 66 Box repeat proteins belonging to different organisms, including viruses.
Asunto(s)
Proteínas , Secuencias Repetidas en Tándem , Proteínas/química , Secuencias Repetidas en Tándem/genética , ADN/genéticaRESUMEN
Tandem Repeat Proteins (TRPs) are a class of proteins with repetitive amino acid sequences that have been studied extensively for over two decades. Different features at the level of sequence, structure, function and evolution have been attributed to them by various authors. And yet many of its salient features appear only when looking at specific subclasses of protein tandem repeats. Here, we attempt to rationalize the existing knowledge on Tandem Repeat Proteins (TRPs) by pointing out several dichotomies. The emerging picture is more nuanced than generally assumed and allows us to draw some boundaries of what is not a "proper" TRP. We conclude with an operational definition of a specific subset, which we have denominated STRPs (Structural Tandem Repeat Proteins), which separates a subclass of tandem repeats with distinctive features from several other less well-defined types of repeats. We believe that this definition will help researchers in the field to better characterize the biological meaning of this large yet largely understudied group of proteins.
Asunto(s)
Proteínas , Secuencias Repetidas en Tándem , Proteínas/genética , Proteínas/química , Secuencias Repetidas en Tándem/genética , Secuencia de AminoácidosRESUMEN
SUMMARY: A collection of conformers that exist in a dynamical equilibrium defines the native state of a protein. The structural differences between them describe their conformational diversity, a defining characteristic of the protein with an essential role in multiple cellular processes. Since most proteins carry out their functions by assembling into complexes, we have developed CoDNaS-Q, the first online resource to explore conformational diversity in homooligomeric proteins. It features a curated collection of redundant protein structures with known quaternary structure. CoDNaS-Q integrates relevant annotations that allow researchers to identify and explore the extent and possible reasons of conformational diversity in homooligomeric protein complexes. AVAILABILITY AND IMPLEMENTATION: CoDNaS-Q is freely accessible at http://ufq.unq.edu.ar/codnasq/ or https://codnas-q.bioinformatica.org/home. The data can be retrieved from the website. The source code of the database can be downloaded from https://github.com/SfrRonaldo/codnas-q.
Asunto(s)
Proteínas , Programas Informáticos , Proteínas/química , Conformación Proteica , Bases de Datos FactualesRESUMEN
The MobiDB database (URL: https://mobidb.org/) provides predictions and annotations for intrinsically disordered proteins. Here, we report recent developments implemented in MobiDB version 4, regarding the database format, with novel types of annotations and an improved update process. The new website includes a re-designed user interface, a more effective search engine and advanced API for programmatic access. The new database schema gives more flexibility for the users, as well as simplifying the maintenance and updates. In addition, the new entry page provides more visualisation tools including customizable feature viewer and graphs of the residue contact maps. MobiDB v4 annotates the binding modes of disordered proteins, whether they undergo disorder-to-order transitions or remain disordered in the bound state. In addition, disordered regions undergoing liquid-liquid phase separation or post-translational modifications are defined. The integrated information is presented in a simplified interface, which enables faster searches and allows large customized datasets to be downloaded in TSV, Fasta or JSON formats. An alternative advanced interface allows users to drill deeper into features of interest. A new statistics page provides information at database and proteome levels. The new MobiDB version presents state-of-the-art knowledge on disordered proteins and improves data accessibility for both computational and experimental users.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Algoritmos , Internet , Anotación de Secuencia Molecular , Procesamiento Proteico-Postraduccional , Programas InformáticosRESUMEN
The RepeatsDB database (URL: https://repeatsdb.org/) provides annotations and classification for protein tandem repeat structures from the Protein Data Bank (PDB). Protein tandem repeats are ubiquitous in all branches of the tree of life. The accumulation of solved repeat structures provides new possibilities for classification and detection, but also increasing the need for annotation. Here we present RepeatsDB 3.0, which addresses these challenges and presents an extended classification scheme. The major conceptual change compared to the previous version is the hierarchical classification combining top levels based solely on structural similarity (Class > Topology > Fold) with two new levels (Clan > Family) requiring sequence similarity and describing repeat motifs in collaboration with Pfam. Data growth has been addressed with improved mechanisms for browsing the classification hierarchy. A new UniProt-centric view unifies the increasingly frequent annotation of structures from identical or similar sequences. This update of RepeatsDB aligns with our commitment to develop a resource that extracts, organizes and distributes specialized information on tandem repeat protein structures.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Secuencias Repetitivas de Aminoácido , Secuencias Repetidas en Tándem , Ontología de Genes , Células HEK293 , Células HeLa , Humanos , Reproducibilidad de los Resultados , Estadística como Asunto , Interfaz Usuario-ComputadorRESUMEN
The Protein Ensemble Database (PED) (https://proteinensemble.org), which holds structural ensembles of intrinsically disordered proteins (IDPs), has been significantly updated and upgraded since its last release in 2016. The new version, PED 4.0, has been completely redesigned and reimplemented with cutting-edge technology and now holds about six times more data (162 versus 24 entries and 242 versus 60 structural ensembles) and a broader representation of state of the art ensemble generation methods than the previous version. The database has a completely renewed graphical interface with an interactive feature viewer for region-based annotations, and provides a series of descriptors of the qualitative and quantitative properties of the ensembles. High quality of the data is guaranteed by a new submission process, which combines both automatic and manual evaluation steps. A team of biocurators integrate structured metadata describing the ensemble generation methodology, experimental constraints and conditions. A new search engine allows the user to build advanced queries and search all entry fields including cross-references to IDP-related resources such as DisProt, MobiDB, BMRB and SASBDB. We expect that the renewed PED will be useful for researchers interested in the atomic-level understanding of IDP function, and promote the rational, structure-based design of IDP-targeting drugs.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Humanos , Motor de Búsqueda , Proteína p53 Supresora de Tumor/químicaRESUMEN
Major scientific challenges that are beyond the capability of individuals need to be addressed by multi-disciplinary and multi-institutional consortia. Examples of these endeavours include the Human Genome Project, and more recently, the Structural Genomics (SG) initiative. The SG initiative pursues the expansion of structural coverage to include at least one structural representative for each protein family to derive the remaining structures using homology modelling. However, biological function is inherently connected with protein dynamics that can be studied by knowing different structures of the same protein. This ensemble of structures provides snapshots of protein conformational diversity under native conditions. Thus, sequence redundancy in the Protein Data Bank (PDB) (i.e. crystallization of the same protein under different conditions) is therefore an essential input contributing to experimentally based studies of protein dynamics and providing insights into protein function. In this work, we show that sequence redundancy, a key concept for exploring protein dynamics, is highly biased and fundamentally incomplete in the PDB. Additionally, our results show that dynamical behaviour of proteins cannot be inferred using homologous proteins. Minor to moderate changes in sequence can produce great differences in dynamical behaviour. Nonetheless, the structural and dynamical incompleteness of the PDB is apparently unrelated concepts in SG. While the first could be reversed by promoting the extension of the structural coverage, we would like to emphasize that further focused efforts will be needed to amend the incompleteness of the PDB in terms of dynamical information content, essential to fully understand protein function.
Asunto(s)
Bases de Datos de Proteínas/estadística & datos numéricos , Biología Computacional/métodos , Biología Computacional/estadística & datos numéricos , Cristalografía por Rayos X , Genómica/estadística & datos numéricos , Humanos , Simulación de Dinámica Molecular , Conformación Proteica , Proteínas/química , Proteínas/genética , Proteómica/estadística & datos numéricos , Homología de Secuencia de Aminoácido , Homología Estructural de ProteínaRESUMEN
Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling.
Asunto(s)
Procesamiento Proteico-Postraduccional , Proteínas/metabolismo , Células HeLa , Humanos , Hidroxilación , Transducción de SeñalRESUMEN
Intrinsically disordered protein regions are commonly defined from missing electron density in X-ray structures. Experimental evidence for long disorder regions (LDRs) of at least 30 residues was so far limited to manually curated proteins. Here, we describe a comprehensive and large-scale analysis of experimental LDRs for 3133 unique proteins, demonstrating an increasing coverage of intrinsic disorder in the Protein Data Bank (PDB) in the last decade. The results suggest that long missing residue regions are a good quality source to annotate intrinsically disordered regions and perform functional analysis in large data sets. The consensus approach used to define LDRs allows to evaluate context dependent disorder and provide a common definition at the protein level.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Modelos Moleculares , Animales , HumanosRESUMEN
The Critical Assessment of Genome Interpretation-5 intellectual disability challenge asked to use computational methods to predict patient clinical phenotypes and the causal variant(s) based on an analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental manifestations (i.e. ID, autism, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) have been made available for this challenge. For each patient, predictors had to report the causative variants and which of the seven phenotypes were present. Since neurodevelopmental disorders are characterized by strong comorbidity, tested individuals often present more than one pathological condition. Considering the overall clinical manifestation of each patient, the correct phenotype has been predicted by at least one group for 93 individuals (62%). ID and ASD were the best predicted among the seven phenotypic traits. Also, causative or potentially pathogenic variants were predicted correctly by at least one group. However, the prediction of the correct causative variant seems to be insufficient to predict the correct phenotype. In some cases, the correct prediction has been supported by rare or common variants in genes different from the causative one.
Asunto(s)
Trastorno del Espectro Autista/genética , Biología Computacional/métodos , Discapacidad Intelectual/genética , Análisis de Secuencia de ADN/métodos , Femenino , Predisposición Genética a la Enfermedad , Humanos , Masculino , Fenotipo , Sitios de Carácter CuantitativoRESUMEN
The CAGI-5 pericentriolar material 1 (PCM1) challenge aimed to predict the effect of 38 transgenic human missense mutations in the PCM1 protein implicated in schizophrenia. Participants were provided with 16 benign variants (negative controls), 10 hypomorphic, and 12 loss of function variants. Six groups participated and were asked to predict the probability of effect and standard deviation associated to each mutation. Here, we present the challenge assessment. Prediction performance was evaluated using different measures to conclude in a final ranking which highlights the strengths and weaknesses of each group. The results show a great variety of predictions where some methods performed significantly better than others. Benign variants played an important role as negative controls, highlighting predictors biased to identify disease phenotypes. The best predictor, Bromberg lab, used a neural-network-based method able to discriminate between neutral and non-neutral single nucleotide polymorphisms. The CAGI-5 PCM1 challenge allowed us to evaluate the state of the art techniques for interpreting the effect of novel variants for a difficult target protein.