RESUMEN
MOTIVATION: In recent years, high-throughput sequencing technologies have made available the genome sequences of a huge variety of organisms. However, the functional annotation of the encoded proteins often still relies on low-throughput and costly experimental studies. Bioinformatics approaches offer a promising alternative to accelerate this process. In this work, we focus on the binding of zinc(II) ions, which is needed for 5%-10% of any organism's proteins to achieve their physiologically relevant form. RESULTS: To implement a predictor of zinc(II)-binding sites in the 3D structures of proteins, we used a neural network, followed by a filter of the network output against the local structure of all known sites. The latter was implemented as a function comparing the distance matrices of the Cα and Cß atoms of the sites. We called the resulting tool Master of Metals (MOM). The structural models for the entire proteome of an organism generated by AlphaFold can be used as input to our tool in order to achieve annotation at the whole organism level within a few hours. To demonstrate this, we applied MOM to the yeast proteome, obtaining a precision of about 76%, based on data for homologous proteins. AVAILABILITY AND IMPLEMENTATION: Master of Metals has been implemented in Python and is available at https://github.com/cerm-cirmmp/Master-of-metals.
Asunto(s)
Programas Informáticos , Zinc , Sitios de Unión , ProteomaRESUMEN
A major issue in the application of deep learning is the definition of a proper architecture for the learning machine at hand, in such a way that the model is neither excessively large (which results in overfitting the training data) nor too small (which limits the learning and modeling capabilities of the automatic learner). Facing this issue boosted the development of algorithms for automatically growing and pruning the architectures as part of the learning process. The paper introduces a novel approach to growing the architecture of deep neural networks, called downward-growing neural network (DGNN). The approach can be applied to arbitrary feed-forward deep neural networks. Groups of neurons that negatively affect the performance of the network are selected and grown with the aim of improving the learning and generalization capabilities of the resulting machine. The growing process is realized via replacement of these groups of neurons with sub-networks that are trained relying on ad hoc target propagation techniques. In so doing, the growth process takes place simultaneously in both the depth and width of the DGNN architecture. We assess empirically the effectiveness of the DGNN on several UCI datasets, where the DGNN significantly improves the average accuracy over a range of established deep neural network approaches and over two popular growing algorithms, namely, the AdaNet and the cascade correlation neural network.
RESUMEN
Thirty-eight percent of protein structures in the Protein Data Bank contain at least one metal ion. However, not all these metal sites are biologically relevant. Cations present as impurities during sample preparation or in the crystallization buffer can cause the formation of protein-metal complexes that do not exist in vivo. We implemented a deep learning approach to build a classifier able to distinguish between physiological and adventitious zinc-binding sites in the 3D structures of metalloproteins. We trained the classifier using manually annotated sites extracted from the MetalPDB database. Using a 10-fold cross validation procedure, the classifier achieved an accuracy of about 90%. The same neural classifier could predict the physiological relevance of non-heme mononuclear iron sites with an accuracy of nearly 80%, suggesting that the rules learned on zinc sites have general relevance. By quantifying the relative importance of the features describing the input zinc sites from the network perspective and by analyzing the characteristics of the MetalPDB datasets, we inferred some common principles. Physiological sites present a low solvent accessibility of the aminoacids forming coordination bonds with the metal ion (the metal ligands), a relatively large number of residues in the metal environment (≥20), and a distinct pattern of conservation of Cys and His residues in the site. Adventitious sites, on the other hand, tend to have a low number of donor atoms from the polypeptide chain (often one or two). These observations support the evaluation of the physiological relevance of novel metal-binding sites in protein structures.
Asunto(s)
Metaloproteínas , Sitios de Unión , Bases de Datos de Proteínas , Metaloproteínas/metabolismo , Metales/química , Redes Neurales de la Computación , Zinc/metabolismoRESUMEN
Nuclear magnetic resonance (NMR) is an effective, commonly used experimental approach to screen small organic molecules against a protein target. A very popular method consists of monitoring the changes of the NMR chemical shifts of the protein nuclei upon addition of the small molecule to the free protein. Multidimensional NMR experiments allow the interacting residues to be mapped along the protein sequence. A significant amount of human effort goes into manually tracking the chemical shift variations, especially when many signals exhibit chemical shift changes and when many ligands are tested. Some computational approaches to automate the procedure are available, but none of them as a web server. Furthermore, some methods require the adoption of a fairly specific experimental setup, such as recording a series of spectra at increasing small molecule:protein ratios. In this work, we developed a tool requesting a minimal amount of experimental data from the user, implemented it as an open-source program, and made it available as a web application. Our tool compares two spectra, one of the free protein and one of the small molecule:protein mixture, based on the corresponding peak lists. The performance of the tool in terms of correct identification of the protein-binding regions has been evaluated on different protein targets, using experimental data from interaction studies already available in the literature. For a total of 16 systems, our tool achieved between 79% and 100% correct assignments, properly identifying the protein regions involved in the interaction.
Asunto(s)
Algoritmos , Proteínas , Secuencia de Aminoácidos , Humanos , Ligandos , Espectroscopía de Resonancia Magnética/métodos , Resonancia Magnética Nuclear Biomolecular/métodos , Proteínas/químicaRESUMEN
Metalloproteins are ubiquitous in all kingdoms of life. Their role and function are tightly related to the local structure of the metal-binding site. In this regard, the MetalPDB database is an invaluable tool since it stores the 3D structure of metal-binding sites and of their corresponding apo forms. In this work, we exploited MetalPDB to compute extensive statistics over >3000 clusters of mononuclear sites about the rearrangements occurring upon change in metalation state. For each cluster, we matched the holo and apo sites so that it was possible to average the distances between all possible pairs of Cα and donor atoms and thus quantitatively assess structural variations by computing the Δ values (mean apo distance - mean holo distance). For most of the structures the backbone is rigid with little to no rearrangement, while donor atoms experience significant changes of their relative position when the metal is removed. Sodium and potassium sites are an exception to this general observation. This is most likely caused by their preference for coordination by the main-chain oxygen atoms, making the rearrangement of donor atoms superimposable to that of the backbone. Magnesium and calcium show a different behavior, despite their chemical similarity: calcium sites undergo a larger reorganization upon metalation although both metals have similar percentage of backbone oxygen as donor atoms. We ascribe this observation to the structural and energetic factors regulating the selectivity for calcium over magnesium.