RESUMEN
BACKGROUND: Protein language models, inspired by the success of large language models in deciphering human language, have emerged as powerful tools for unraveling the intricate code of life inscribed within protein sequences. They have gained significant attention for their promising applications across various areas, including the sequence-based prediction of secondary and tertiary protein structure, the discovery of new functional protein sequences/folds, and the assessment of mutational impact on protein fitness. However, their utility in learning to predict protein residue properties based on scant datasets, such as protein-protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore the feasibility of using protein language-learned representations as features for machine learning to predict PPI-hotspots using a dataset containing 414 experimentally confirmed PPI-hotspots and 504 PPI-nonhot spots. RESULTS: Our findings showcase the capacity of unsupervised learning with protein language models in capturing critical functional attributes of protein residues derived from the evolutionary information encoded within amino acid sequences. We show that methods relying on protein language models can compete with methods employing sequence and structure-based features to predict PPI-hotspots from the free protein structure. We observed an optimal number of features for model precision, suggesting a balance between information and overfitting. CONCLUSIONS: This study underscores the potential of transformer-based protein language models to extract critical knowledge from sparse datasets, exemplified here by the challenging realm of predicting PPI-hotspots. These models offer a cost-effective and time-efficient alternative to traditional experimental methods for predicting certain residue properties. However, the challenge of explaining why specific features are important for determining certain residue properties remains.
Asunto(s)
Aprendizaje Automático , Proteínas , Humanos , Proteínas/química , Secuencia de AminoácidosRESUMEN
Proteins form complex biological machineries whose functions in the cell are highly regulated at both the cellular and molecular levels. Cellular regulation of protein functions involves differential gene expressions, post-translation modifications, and signaling cascades. Molecular regulation, on the other hand, involves tuning an optimal local protein environment for the functional site. Precisely how a protein achieves such an optimal environment around a given functional site is not well understood. Herein, by surveying the literature, we first summarize the various reported strategies used by certain proteins to ensure their correct functioning. We then formulate three key physicochemical factors for regulating a protein's functional site, namely, (i) its immediate interactions, (ii) its solvent accessibility, and (iii) its conformational flexibility. We illustrate how these factors are applied to regulate the functions of free/metal-bound Cys and Zn sites in proteins.
Asunto(s)
Proteínas/metabolismo , Humanos , Conformación Proteica , Proteínas/químicaRESUMEN
In Zn-proteins, structural Zn-sites are mostly Cys-rich lined by two or more Cys residues, whereas catalytic Zn-sites usually contain His or Asp/Glu residues and a water molecule. Here, we reveal many examples outside this trend with Zn2+ bound to ligands commonly found in both structural and catalytic Zn-sites, namely, Zn-CC(C/H)x (x = D, E, or H2O) sites. We show that these atypical Zn-sites are found in all known life forms (i.e., eukaryotes, bacteria, archaea, and viruses) and can serve structural roles in some proteins but catalytic roles in others. By calculating the physical properties of these atypical Zn-binding sites, we elucidate why Zn-CC(C/H)x sites of the same composition can serve structural and catalytic roles in proteins. Furthermore, we found new sequence/structural motifs characteristic of catalytic Zn-CCHw sites and provide guidelines to predict the structural/catalytic role of atypical Zn-CC(C/H)x sites of unknown function. We discuss how our results could help to design inhibitors targeting catalytic Zn-CC(C/H) H2O sites.
Asunto(s)
Modelos Moleculares , Proteínas/química , Proteínas/metabolismo , Zinc/metabolismo , Sitios de Unión , Ligandos , Conformación ProteicaRESUMEN
Increasing numbers of protein structures are solved each year, but many of these structures belong to proteins whose sequences are homologous to sequences in the Protein Data Bank. Nevertheless, the structures of homologous proteins belonging to the same family contain useful information because functionally important residues are expected to preserve physico-chemical, structural and energetic features. This information forms the basis of our method, which detects RNA-binding residues of a given RNA-binding protein as those residues that preserve physico-chemical, structural and energetic features in its homologs. Tests on 81 RNA-bound and 35 RNA-free protein structures showed that our method yields a higher fraction of true RNA-binding residues (higher precision) than two structure-based and two sequence-based machine-learning methods. Because the method requires no training data set and has no parameters, its precision does not degrade when applied to 'novel' protein sequences unlike methods that are parameterized for a given training data set. It was used to predict the 'unknown' RNA-binding residues in the C-terminal RNA-binding domain of human CPEB3. The two predicted residues, F430 and F474, were experimentally verified to bind RNA, in particular F430, whose mutation to alanine or asparagine nearly abolished RNA binding. The method has been implemented in a webserver called DR_bind1, which is freely available with no login requirement at http://drbind.limlab.ibms.sinica.edu.tw.
Asunto(s)
Aminoácidos/química , Proteínas de Unión al ARN/química , Sitios de Unión , Proteínas de Unión al ADN/química , Evolución Molecular , Humanos , Unión Proteica , Conformación Proteica , ARN/química , ARN/metabolismo , Proteínas de Unión al ARN/metabolismo , Programas Informáticos , Electricidad EstáticaRESUMEN
Dihedral angles are good descriptors of the numerous conformations visited by large, flexible systems, but their analysis requires directional statistics. A single package including the various multivariate statistical methods for angular data that accounts for the distinct topology of such data does not exist. Here, we present a lightweight standalone, operating-system independent package called Clustangles to fill this gap. Clustangles will be useful in analyzing the ever-increasing number of structures in the Protein Data Bank and clustering the copious conformations from increasingly long molecular dynamics simulations.
Asunto(s)
Proteínas/química , Algoritmos , Análisis por Conglomerados , Bases de Datos de Proteínas , Simulación de Dinámica Molecular , Análisis de Componente Principal , Conformación Proteica , Programas InformáticosRESUMEN
The GeoPCA package is the first tool developed for multivariate analysis of dihedral angles based on principal component geodesics. Principal component geodesic analysis provides a natural generalization of principal component analysis for data distributed in non-Euclidean space, as in the case of angular data. GeoPCA presents projection of angular data on a sphere composed of the first two principal component geodesics, allowing clustering based on dihedral angles as opposed to Cartesian coordinates. It also provides a measure of the similarity between input structures based on only dihedral angles, in analogy to the root-mean-square deviation of atoms based on Cartesian coordinates. The principal component geodesic approach is shown herein to reproduce clusters of nucleotides observed in an η-θ plot. GeoPCA can be accessed via http://pca.limlab.ibms.sinica.edu.tw.
Asunto(s)
Conformación de Ácido Nucleico , Análisis de Componente Principal , Programas Informáticos , Análisis por Conglomerados , Análisis Multivariante , Ribonucleótidos/químicaRESUMEN
Experimental detection of residues critical for protein-protein interactions (PPI) is a time-consuming, costly, and labor-intensive process. Hence, high-throughput PPI-hot spot prediction methods have been developed, but they have been validated using relatively small datasets, which may compromise their predictive reliability. Here, we introduce PPI-hotspotID, a novel method for identifying PPI-hot spots using the free protein structure, and validated it on the largest collection of experimentally confirmed PPI-hot spots to date. We explored the possibility of detecting PPI-hot spots using (i) FTMap in the PPI mode, which identifies hot spots on protein-protein interfaces from the free protein structure, and (ii) the interface residues predicted by AlphaFold-Multimer. PPI-hotspotID yielded better performance than FTMap and SPOTONE, a webserver for predicting PPI-hot spots given the protein sequence. When combined with the AlphaFold-Multimer-predicted interface residues, PPI-hotspotID yielded better performance than either method alone. Furthermore, we experimentally verified several PPI-hotspotID-predicted PPI-hot spots of eukaryotic elongation factor 2. Notably, PPI-hotspotID can reveal PPI-hot spots not obvious from complex structures, including those in indirect contact with binding partners. PPI-hotspotID serves as a valuable tool for understanding PPI mechanisms and aiding drug design. It is available as a web server (https://ppihotspotid.limlab.dnsalias.org/) and open-source code (https://github.com/wrigjz/ppihotspotid/).
Asunto(s)
Mapeo de Interacción de Proteínas , Mapeo de Interacción de Proteínas/métodos , Conformación Proteica , Biología Computacional/métodos , Proteínas/química , Proteínas/metabolismo , Unión Proteica , Programas InformáticosRESUMEN
Developing programmable bacterial cell-cell adhesion is of significant interest due to its versatile applications. Current methods that rely on presenting cell adhesion molecules (CAMs) on bacterial surfaces are limited by the lack of a generalizable strategy to identify such molecules targeting bacterial membrane proteins in their natural states. Here, we introduce a whole-cell screening platform designed to discover CAMs targeting bacterial membrane proteins within a synthetic bacteria-displayed nanobody library. Leveraging the potency of the bacterial type IV secretion system-a contact-dependent DNA delivery nanomachine-we have established a positive feedback mechanism to selectively enrich for bacteria displaying nanobodies that target antigen-expressing cells. Our platform successfully identified functional CAMs capable of recognizing three distinct outer membrane proteins (TraN, OmpA, OmpC), demonstrating its efficacy in CAM discovery. This approach holds promise for engineering bacterial cell-cell adhesion, such as directing the antibacterial activity of programmed inhibitor cells toward target bacteria in mixed populations.
Asunto(s)
Adhesión Bacteriana , Moléculas de Adhesión Celular , Anticuerpos de Dominio Único , Moléculas de Adhesión Celular/metabolismo , Moléculas de Adhesión Celular/genética , Anticuerpos de Dominio Único/metabolismo , Proteínas de la Membrana Bacteriana Externa/metabolismo , Proteínas de la Membrana Bacteriana Externa/genética , Escherichia coli/metabolismo , Bacterias/metabolismoRESUMEN
Alterations in viral fitness cannot be inferred from only mutagenesis studies of an isolated viral protein. To-date, no systematic analysis has been performed to identify mutations that improve virus fitness and reduce drug efficacy. We present a generic strategy to evaluate which viral mutations might diminish drug efficacy and applied it to assess how SARS-CoV-2 evolution may affect the efficacy of current approved/candidate small-molecule antivirals for Mpro, PLpro, and RdRp. For each drug target, we determined the drug-interacting virus residues from available structures and the selection pressure of the virus residues from the SARS-CoV-2 genomes. This enabled the identification of promising drug target regions and small-molecule antivirals that the virus can develop resistance. Our strategy of utilizing sequence and structural information from genomic sequence and protein structure databanks can rapidly assess the fitness of any emerging virus variants and can aid antiviral drug design for future pathogens.
Asunto(s)
Antivirales , Farmacorresistencia Viral , SARS-CoV-2 , Humanos , Antivirales/farmacología , COVID-19 , Mutación , SARS-CoV-2/efectos de los fármacos , SARS-CoV-2/genética , Farmacorresistencia Viral/genéticaRESUMEN
Structural 3D motifs in RNA play an important role in the RNA stability and function. Previous studies have focused on the characterization and discovery of 3D motifs in RNA secondary and tertiary structures. However, statistical analyses of the distribution of 3D motifs along the RNA appear to be lacking. Herein, we present a novel strategy for evaluating the distribution of 3D motifs along the RNA chain and those motifs whose distributions are significantly non-random are identified. By applying it to the X-ray structure of the large ribosomal subunit from Haloarcula marismortui, helical motifs were found to cluster together along the chain and in the 3D structure, whereas the known tetraloops tend to be sequentially and spatially dispersed. That the distribution of key structural motifs such as tetraloops differ significantly from a random one suggests that our method could also be used to detect novel 3D motifs of any size in sufficiently long/large RNA structures. The motif distribution type can help in the prediction and design of 3D structures of large RNA molecules.
Asunto(s)
ARN Ribosómico/química , Algoritmos , Cristalografía por Rayos X , Interpretación Estadística de Datos , Haloarcula marismortui/genética , Modelos Moleculares , Conformación de Ácido Nucleico , ARN de Archaea/química , ARN Ribosómico 23S/químicaRESUMEN
The COVID-19 pandemic poses a challenge in coming up with quick and effective means to counter its cause, the SARS-CoV-2. Here, we show how the key factors governing cysteine reactivity in proteins derived from combined quantum mechanical/continuum calculations led to a novel multi-targeting strategy against SARS-CoV-2, in contrast to developing potent drugs/vaccines against a single viral target such as the spike protein. Specifically, they led to the discovery of reactive cysteines in evolutionary conserved Zn2+-sites in several SARS-CoV-2 proteins that are crucial for viral polypeptide proteolysis as well as viral RNA synthesis, proofreading, and modification. These conserved, reactive cysteines, both free and Zn2+-bound, can be targeted using the same Zn-ejector drug (disulfiram/ebselen), which enables the use of broad-spectrum anti-virals that would otherwise be removed by the virus's proofreading mechanism. Our strategy of targeting multiple, conserved viral proteins that operate at different stages of the virus life cycle using a Zn-ejector drug combined with other broad-spectrum anti-viral drug(s) could enhance the barrier to drug resistance and antiviral effects, as compared to each drug alone. Since these functionally important nonstructural proteins containing reactive cysteines are highly conserved among coronaviruses, our proposed strategy has the potential to tackle future coronaviruses. This article is categorized under:Structure and Mechanism > Reaction Mechanisms and CatalysisStructure and Mechanism > Computational Biochemistry and BiophysicsElectronic Structure Theory > Density Functional Theory.
RESUMEN
The SARS-CoV-2 replication and transcription complex (RTC) comprising nonstructural protein (nsp) 2-16 plays crucial roles in viral replication, reducing the efficacy of broad-spectrum nucleoside analog drugs such as remdesivir and evading innate immune responses. Most studies target a specific viral component of the RTC such as the main protease or the RNA-dependent RNA polymerase. In contrast, our strategy is to target multiple conserved domains of the RTC to prevent SARS-CoV-2 genome replication and to create a high barrier to viral resistance and/or evasion of antiviral drugs. We show that the clinically safe Zn-ejector drugs disulfiram and ebselen can target conserved Zn2+ sites in SARS-CoV-2 nsp13 and nsp14 and inhibit nsp13 ATPase and nsp14 exoribonuclease activities. As the SARS-CoV-2 nsp14 domain targeted by disulfiram/ebselen is involved in RNA fidelity control, our strategy allows coupling of the Zn-ejector drug with a broad-spectrum nucleoside analog that would otherwise be excised by the nsp14 proofreading domain. As proof-of-concept, we show that disulfiram/ebselen, when combined with remdesivir, can synergistically inhibit SARS-CoV-2 replication in Vero E6 cells. We present a mechanism of action and the advantages of our multitargeting strategy, which can be applied to any type of coronavirus with conserved Zn2+ sites.
RESUMEN
[This corrects the article DOI: 10.1021/acsptsci.1c00022.].
RESUMEN
[This corrects the article DOI: 10.1039/D0SC02646H.].
RESUMEN
We present a near-term treatment strategy to tackle pandemic outbreaks of coronaviruses with no specific drugs/vaccines by combining evolutionary and physical principles to identify conserved viral domains containing druggable Zn-sites that can be targeted by clinically safe Zn-ejecting compounds. By applying this strategy to SARS-CoV-2 polyprotein-1ab, we predicted multiple labile Zn-sites in papain-like cysteine protease (PLpro), nsp10 transcription factor, and nsp13 helicase. These are attractive drug targets because they are highly conserved among coronaviruses and play vital structural/catalytic roles in viral proteins indispensable for virus replication. We show that five Zn-ejectors can release Zn2+ from PLpro and nsp10, and clinically-safe disulfiram and ebselen can not only covalently bind to the Zn-bound cysteines in both proteins, but also inhibit PLpro protease. We propose combining disulfiram/ebselen with broad-spectrum antivirals/drugs to target different conserved domains acting at various stages of the virus life cycle to synergistically inhibit SARS-CoV-2 replication and reduce the emergence of drug resistance.
RESUMEN
The root-mean-square deviation (RMSD) is a similarity measure widely used in analysis of macromolecular structures and dynamics. As increasingly larger macromolecular systems are being studied, dimensionality effects such as the "curse of dimensionality" (a diminishing ability to discriminate pairwise differences between conformations with increasing system size) may exist and significantly impact RMSD-based analyses. For such large bimolecular systems, whether the RMSD or other alternative similarity measures might suffer from this "curse" and lose the ability to discriminate different macromolecular structures had not been explicitly addressed. Here, we show such dimensionality effects for both weighted and nonweighted RMSD schemes. We also provide a mechanism for the emergence of the "curse of dimensionality" for RMSD from the law of large numbers by showing that the conformational distributions from which RMSDs are calculated become increasingly similar as the system size increases. Our findings suggest the use of weighted RMSD schemes for small proteins (less than 200 residues) and nonweighted RMSD for larger proteins when analyzing molecular dynamics trajectories.
Asunto(s)
Simulación de Dinámica Molecular , Proteínas/química , Sustancias Macromoleculares/química , Peso Molecular , Conformación ProteicaRESUMEN
The hydrogen-bonding interactions of cysteine, which can serve as a hydrogen-bond donor and/or acceptor, play a central role in cysteine's diverse functional roles in proteins. They affect the balance between the neutral thiol (SH) or thiolate (S-) and the charge distribution in the rate-limiting transition state of a reaction. Despite their importance, no study has determined the preferred hydrogen-bonding partners of cysteine serving as a hydrogen-bond donor or acceptor. By computing the free energy for displacing a peptide backbone hydrogen-bonded to cysteine with amino acid side chains in various protein environments, we have evaluated how the strength of the hydrogen bond to the cysteine thiol/thiolate depends on its hydrogen-bonding partner and its local environment. The predicted hydrogen-bonding partners preferred by cysteine are consistent with the hydrogen-bonding interactions made by cysteines in 9138 nonredundant X-ray structures. Our results suggest a mechanism to regulate the reactivity of cysteines and a strategy to design drugs based on the hydrogen-bonding preference of cysteine.
Asunto(s)
Cisteína/metabolismo , Cisteína/química , Enlace de Hidrógeno , Modelos Moleculares , TermodinámicaRESUMEN
Many enzymes use nicotinamide adenine dinucleotide or nicotinamide adenine dinucleotide phosphate (NAD(P)) as essential coenzymes. These enzymes often do not share significant sequence identity and cannot be easily detected by sequence homology. Previously, we determined all distinct locally conserved pyrophosphate-binding structures (3d motifs) from NAD(P)-bound protein structures, from which 1d sequence motifs were derived. Here, we aim to establish the precision of these 3d and 1d motifs to annotate NAD(P)-binding proteins. We show that the pyrophosphate-binding 3d motifs are characteristic of NAD(P)-binding proteins, as they are rarely found in nonNAD(P)-binding proteins. Furthermore, several 1d motifs could distinguish between proteins that bind only NAD and those that bind only NADP. They could also distinguish between NAD(P)-binding proteins from nonNAD(P)-binding ones. Interestingly, one of the pyrophosphate-binding 3d and corresponding 1d motifs was found only in enoyl-acyl carrier protein reductases, which are enzymes essential for bacterial fatty acid biosynthesis. This unique 3d motif serves as an attractive novel drug target, as it is conserved across many bacterial species and is not found in human proteins.
Asunto(s)
Secuencias de Aminoácidos/genética , Proteínas Portadoras/genética , NADP/metabolismo , NAD/metabolismo , Secuencias de Aminoácidos/efectos de los fármacos , Antibacterianos/química , Antibacterianos/uso terapéutico , Bacterias/efectos de los fármacos , Bacterias/enzimología , Sistemas de Liberación de Medicamentos , Enoil-ACP Reductasa (NADH)/genética , Humanos , NAD/genética , NADP/genética , Transducción de Señal/genéticaRESUMEN
In this work, we have (i) evaluated the ability of the EMAP method implemented in the CHARMM program to generate the correct conformation of Ab/Ag complex structures and (ii) developed a support vector machine (SVM) classifier to detect native conformations among the thousands of refined Ab/Ag configurations using the individual components of the binding free energy based on a thermodynamic cycle as input features in training the SVM. Tests on 24 Ab/Ag complexes from the protein-protein docking benchmark version 3.0 showed that based on CAPRI evaluation criteria, EMAP could generate medium-quality native conformations in each case. Furthermore, the SVM classifier could rank medium/high-quality native conformations mostly in the top six among the thousands of refined Ab/Ag configurations. Thus, Ab-Ag docking can be performed using different levels of protein representations, from grid-based (EMAP) to polar hydrogen (united-atom) to all-atom representation within the same program. The scripts used and the trained SVM are available at the www.charmm.org forum script repository.