RESUMO
Ribonucleic acids (RNAs) play important roles in cellular regulation. Consequently, dysregulation of both coding and non-coding RNAs has been implicated in several disease conditions in the human body. In this regard, a growing interest has been observed to probe into the potential of RNAs to act as drug targets in disease conditions. To accelerate this search for disease-associated novel RNA targets and their small molecular inhibitors, machine learning models for binding affinity prediction were developed specific to six RNA subtypes namely, aptamers, miRNAs, repeats, ribosomal RNAs, riboswitches and viral RNAs. We found that differences in RNA sequence composition, flexibility and polar nature of RNA-binding ligands are important for predicting the binding affinity. Our method showed an average Pearson correlation (r) of 0.83 and a mean absolute error of 0.66 upon evaluation using the jack-knife test, indicating their reliability despite the low amount of data available for several RNA subtypes. Further, the models were validated with external blind test datasets, which outperform other existing quantitative structure-activity relationship (QSAR) models. We have developed a web server to host the models, RNA-Small molecule binding Affinity Predictor, which is freely available at: https://web.iitm.ac.in/bioinfo2/RSAPred/.
Assuntos
MicroRNAs , Humanos , Reprodutibilidade dos Testes , Ciclo Celular , Aprendizado de Máquina , Relação Quantitativa Estrutura-AtividadeRESUMO
Binding affinity prediction largely determines the discovery efficiency of lead compounds in drug discovery. Recently, machine learning (ML)-based approaches have attracted much attention in hopes of enhancing the predictive performance of traditional physics-based approaches. In this study, we evaluated the impact of structural dynamic information on the binding affinity prediction by comparing the models trained on different dimensional descriptors, using three targets (i.e. JAK1, TAF1-BD2 and DDR1) and their corresponding ligands as the examples. Here, 2D descriptors are traditional ECFP4 fingerprints, 3D descriptors are the energy terms of the Smina and NNscore scoring functions and 4D descriptors contain the structural dynamic information derived from the trajectories based on molecular dynamics (MD) simulations. We systematically investigate the MD-refined binding affinity prediction performance of three classical ML algorithms (i.e. RF, SVR and XGB) as well as two common virtual screening methods, namely Glide docking and MM/PBSA. The outcomes of the ML models built using various dimensional descriptors and their combinations reveal that the MD refinement with the optimized protocol can improve the predictive performance on the TAF1-BD2 target with considerable structural flexibility, but not for the less flexible JAK1 and DDR1 targets, when taking docking poses as the initial structure instead of the crystal structures. The results highlight the importance of the initial structures to the final performance of the model through conformational analysis on the three targets with different flexibility.
Assuntos
Simulação de Dinâmica Molecular , Proteínas , Ligantes , Proteínas/química , Ligação Proteica , Aprendizado de Máquina , Simulação de Acoplamento MolecularRESUMO
Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine-learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly represent the structure and sequence features of proteins and ultimately optimize the mathematical models for predicting CPA. Furthermore, from the perspective of data-driven approach, we proposed a rational method that can utilize both high- and low-quality databases to optimize the accuracy and generalization ability of FeatNN in CPA prediction tasks. Notably, we visually interpret the feature interaction process between sequence and structure in the rationally designed architecture. As a result, FeatNN considerably outperforms the state-of-the-art (SOTA) baseline in virtual drug evaluation tasks, indicating the feasibility of this approach for practical use. FeatNN provides an outstanding method for higher CPA prediction accuracy and better generalization ability by efficiently representing multimodal information of proteins via a coevolutionary strategy.
Assuntos
Aprendizado de Máquina , Proteínas , Ligação Proteica , Proteínas/química , Modelos TeóricosRESUMO
The prediction of interactions between novel drugs and biological targets is a vital step in the early stage of the drug discovery pipeline. Many deep learning approaches have been proposed over the last decade, with a substantial fraction of them sharing the same underlying two-branch architecture. Their distinction is limited to the use of different types of feature representations and branches (multi-layer perceptrons, convolutional neural networks, graph neural networks and transformers). In contrast, the strategy used to combine the outputs (embeddings) of the branches has remained mostly the same. The same general architecture has also been used extensively in the area of recommender systems, where the choice of an aggregation strategy is still an open question. In this work, we investigate the effectiveness of three different embedding aggregation strategies in the area of drug-target interaction (DTI) prediction. We formally define these strategies and prove their universal approximator capabilities. We then present experiments that compare the different strategies on benchmark datasets from the area of DTI prediction, showcasing conditions under which specific strategies could be the obvious choice.
Assuntos
Benchmarking , Descoberta de Drogas , Fontes de Energia Elétrica , Redes Neurais de ComputaçãoRESUMO
Since the start of COVID-19 pandemic, a huge effort has been devoted to understanding the Spike (SARS-CoV-2)-ACE2 recognition mechanism. To this end, two deep mutational scanning studies traced the impact of all possible mutations across receptor binding domain (RBD) of Spike and catalytic domain of human ACE2. By concentrating on the interface mutations of these experimental data, we benchmarked six commonly used structure-based binding affinity predictors (FoldX, EvoEF1, MutaBind2, SSIPe, HADDOCK, and UEP). These predictors were selected based on their user-friendliness, accessibility, and speed. As a result of our benchmarking efforts, we observed that none of the methods could generate a meaningful correlation with the experimental binding data. The best correlation is achieved by FoldX (R = -0.51). When we simplified the prediction problem to a binary classification, that is, whether a mutation is enriching or depleting the binding, we showed that the highest accuracy is achieved by FoldX with a 64% success rate. Surprisingly, on this set, simple energetic scoring functions performed significantly better than the ones using extra evolutionary-based terms, as in Mutabind and SSIPe. Furthermore, we demonstrated that recent AI approaches, mmCSM-PPI and TopNetTree, yielded comparable performances to the force field-based techniques. These observations suggest plenty of room to improve the binding affinity predictors in guessing the variant-induced binding profile changes of a host-pathogen system, such as Spike-ACE2. To aid such improvements we provide our benchmarking data at https://github.com/CSB-KaracaLab/RBD-ACE2-MutBench with the option to visualize our mutant models at https://rbd-ace2-mutbench.github.io/.
Assuntos
Enzima de Conversão de Angiotensina 2 , Benchmarking , Humanos , Pandemias , Mutação , Evolução Biológica , Ligação ProteicaRESUMO
In multiple instance learning (MIL), a bag represents a sample that has a set of instances, each of which is described by a vector of explanatory variables, but the entire bag only has one label/response. Though many methods for MIL have been developed to date, few have paid attention to interpretability of models and results. The proposed Bayesian regression model stands on two levels of hierarchy, which transparently show how explanatory variables explain and instances contribute to bag responses. Moreover, two selection problems are simultaneously addressed; the instance selection to find out the instances in each bag responsible for the bag response, and the variable selection to search for the important covariates. To explore a joint discrete space of indicator variables created for selection of both explanatory variables and instances, the shotgun stochastic search algorithm is modified to fit in the MIL context. Also, the proposed model offers a natural and rigorous way to quantify uncertainty in coefficient estimation and outcome prediction, which many modern MIL applications call for. The simulation study shows the proposed regression model can select variables and instances with high performance (AUC greater than 0.86), thus predicting responses well. The proposed method is applied to the musk data for prediction of binding strengths (labels) between molecules (bags) with different conformations (instances) and target receptors. It outperforms all existing methods, and can identify variables relevant in modeling responses.
RESUMO
Recently, we identified a novel mechanism of enzyme inhibition in N-myristoyltransferases (NMTs), which we have named 'inhibitor trapping'. Inhibitor trapping occurs when the protein captures the small molecule within its structural confines, thereby preventing its free dissociation and resulting in a dramatic increase in inhibitor affinity and potency. Here, we demonstrate that inhibitor trapping also occurs in the kinases. Remarkably, the drug imatinib, which has revolutionized targeted cancer therapy, is entrapped in the structure of the Abl kinase. This effect is also observed in p38α kinase, where inhibitor trapping was found to depend on a 'magic' methyl group, which stabilizes the protein conformation and increases the affinity of the compound dramatically. Altogether, these results suggest that inhibitor trapping is not exclusive to N-myristoyltransferases, as it also occurs in the kinase family. Inhibitor trapping could enhance the binding affinity of an inhibitor by thousands of times and is as a key mechanism that plays a critical role in determining drug affinity and potency.
Assuntos
Piperazinas , Pirimidinas , Pirimidinas/farmacologia , Piperazinas/farmacologia , Benzamidas/farmacologia , Mesilato de Imatinib/farmacologia , Proteínas de Fusão bcr-abl/metabolismo , Quinases da Família src/metabolismo , Inibidores de Proteínas Quinases/farmacologia , Inibidores de Proteínas Quinases/uso terapêuticoRESUMO
Structure-based virtual screenings (SBVSs) play an important role in drug discovery projects. However, it is still a challenge to accurately predict the binding affinity of an arbitrary molecule binds to a drug target and prioritize top ligands from an SBVS. In this study, we developed a novel method, using ligand-residue interaction profiles (IPs) to construct machine learning (ML)-based prediction models, to significantly improve the screening performance in SBVSs. Such a kind of the prediction model is called an IP scoring function (IP-SF). We systematically investigated how to improve the performance of IP-SFs from many perspectives, including the sampling methods before interaction energy calculation and different ML algorithms. Using six drug targets with each having hundreds of known ligands, we conducted a critical evaluation on the developed IP-SFs. The IP-SFs employing a gradient boosting decision tree (GBDT) algorithm in conjunction with the MIN + GB simulation protocol achieved the best overall performance. Its scoring power, ranking power and screening power significantly outperformed the Glide SF. First, compared with Glide, the average values of mean absolute error and root mean square error of GBDT/MIN + GB decreased about 38 and 36%, respectively. Second, the mean values of squared correlation coefficient and predictive index increased about 225 and 73%, respectively. Third, more encouragingly, the average value of the areas under the curve of receiver operating characteristic for six targets by GBDT, 0.87, is significantly better than that by Glide, which is only 0.71. Thus, we expected IP-SFs to have broad and promising applications in SBVSs.
Assuntos
Aprendizado Profundo , Descoberta de Drogas/métodos , Simulação de Acoplamento Molecular/métodos , Proteínas Quinases/metabolismo , Receptores Acoplados a Proteínas G/metabolismo , Algoritmos , Cristalização , Bases de Dados de Proteínas , Avaliação Pré-Clínica de Medicamentos/métodos , Humanos , Ligantes , Estrutura Molecular , Ligação Proteica , Proteínas Quinases/química , Receptores Acoplados a Proteínas G/químicaRESUMO
Drug discovery, especially virtual screening and drug repositioning, can be accelerated through deeper understanding and prediction of Drug Target Interactions (DTIs). The advancement of deep learning as well as the time and financial costs associated with conventional wet-lab experiments have made computational methods for DTI prediction more popular. However, the majority of these computational methods handle the DTI problem as a binary classification task, ignoring the quantitative binding affinity that determines the drug efficacy to their target proteins. Moreover, computational space as well as execution time of the model is often ignored over accuracy. To address these challenges, we introduce a novel method, called Time-efficient Multimodal Drug Target Binding Affinity (TeM-DTBA), which predicts the binding affinity between drugs and targets by fusing different modalities based on compound structures and target sequences. We employ the Lasso feature selection method, which lowers the dimensionality of feature vectors and speeds up the proposed model training time by more than 50%. The results from two benchmark datasets demonstrate that our method outperforms state-of-the-art methods in terms of performance. The mean squared errors of 18.8% and 23.19%, achieved on the KIBA and Davis datasets, respectively, suggest that our method is more accurate in predicting drug-target binding affinity.
Assuntos
Desenvolvimento de Medicamentos , Proteínas , Proteínas/química , Descoberta de Drogas/métodos , Reposicionamento de Medicamentos , Sistemas de Liberação de MedicamentosRESUMO
The accurate prediction of drug-target binding affinity (DTA) is an essential step in drug discovery and drug repositioning. Although deep learning methods have been widely adopted for DTA prediction, the complexity of extracting drug and target protein features hampers the accuracy of these predictions. In this study, we propose a novel model for DTA prediction named MSGNN-DTA, which leverages a fused multi-scale topological feature approach based on graph neural networks (GNNs). To address the challenge of accurately extracting drug and target protein features, we introduce a gated skip-connection mechanism during the feature learning process to fuse multi-scale topological features, resulting in information-rich representations of drugs and proteins. Our approach constructs drug atom graphs, motif graphs, and weighted protein graphs to fully extract topological information and provide a comprehensive understanding of underlying molecular interactions from multiple perspectives. Experimental results on two benchmark datasets demonstrate that MSGNN-DTA outperforms the state-of-the-art models in all evaluation metrics, showcasing the effectiveness of the proposed approach. Moreover, the study conducts a case study based on already FDA-approved drugs in the DrugBank dataset to highlight the potential of the MSGNN-DTA framework in identifying drug candidates for specific targets, which could accelerate the process of virtual screening and drug repositioning.
Assuntos
Descoberta de Drogas , Reposicionamento de Medicamentos , Benchmarking , Sistemas de Liberação de Medicamentos , Redes Neurais de ComputaçãoRESUMO
Deep learning is revolutionizing structural biology to an unprecedented extent. Spearheaded by DeepMind's Alphafold2, structural models of high quality can be generated, and are now available for most known proteins and many protein interactions. The next challenge will be to leverage this rich structural corpus to learn about binding: which protein can contact which partner(s), and at what affinity? In a recent study, Chang and Perez have presented an elegant approach towards this challenging goal for interactions that involve a short peptide binding to its receptor. The basic idea is straightforward: given a receptor that binds to two peptides, if the receptor sequence is presented with both peptides together at the same time, AlphaFold2 should model the tighter binding peptide into the binding site, while excluding the second. A simple idea that works!
Assuntos
Peptídeos , Proteínas , Ligação Proteica , Proteínas/química , Peptídeos/química , Sítios de Ligação , Domínios ProteicosRESUMO
Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A number of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examined. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of experimental binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those observed among different deep learning approaches. In particular, the presence of proteins in the training data leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.
Assuntos
Aprendizado Profundo , Confiabilidade dos Dados , Ligantes , Aprendizado de Máquina , Ligação Proteica , Proteínas/metabolismoRESUMO
SARS-CoV-2 infection elicits a polyclonal neutralizing antibody (nAb) response that primarily targets the spike protein, but it is still unclear which nAbs are immunodominant and what distinguishes them from subdominant nAbs. This information would however be crucial to predict the evolutionary trajectory of the virus and design future vaccines. To shed light on this issue, we gathered 83 structures of nAbs in complex with spike protein domains. We analyzed in silico the ability of these nAbs to bind the full spike protein trimer in open and closed conformations, and predicted the change in binding affinity of the most frequently observed spike protein variants in the circulating strains. This led us to define four nAb classes with distinct variant escape fractions. By comparing these fractions with those measured from plasma of infected patients, we showed that the class of nAbs that most contributes to the immune response is able to bind the spike protein in its closed conformation. Although this class of nAbs only partially inhibits the spike protein binding to the host's angiotensin converting enzyme 2 (ACE2), it has been suggested to lock the closed pre-fusion spike protein conformation and therefore prevent its transition to an open state. Furthermore, comparison of our predictions with mRNA-1273 vaccinated patient plasma measurements suggests that spike proteins contained in vaccines elicit a different nAb class than the one elicited by natural SARS-CoV-2 infection and suggests the design of highly stable closed-form spike proteins as next-generation vaccine immunogens.
Assuntos
Anticorpos Neutralizantes/imunologia , SARS-CoV-2/metabolismo , Glicoproteína da Espícula de Coronavírus/imunologia , Enzima de Conversão de Angiotensina 2/química , Enzima de Conversão de Angiotensina 2/metabolismo , Anticorpos Monoclonais/imunologia , Reações Antígeno-Anticorpo , COVID-19/patologia , COVID-19/virologia , Epitopos/imunologia , Humanos , Mutagênese , Ligação Proteica , Conformação Proteica , SARS-CoV-2/isolamento & purificação , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Glicoproteína da Espícula de Coronavírus/metabolismoRESUMO
Accurately predicting ligand binding affinity in a virtual screening campaign is still challenging. Here, we developed hybrid neural network (HNN) machine deep learning methods, HNN-denovo and HNN-affinity, by combining the 3D-CNN (convolutional neural network) and the FFNN (fast forward neural network) hybrid neural network framework. The HNN-denovo uses protein pocket structure and protein-ligand interactions as input features. The HNN-affinity uses protein sequences and ligand features as input features. The HNN method combines the CNN and FCNN machine architecture for the protein structure or protein sequence and ligand descriptors. To train the model, the HNN methods used thousands of known protein-ligand binding affinity data retrieved from the PDBBind database. We also developed the Random Forest (RF), Gradient Boosting (GB), Decision Tree with AdaBoost (DT), and a consensus model. We compared the HNN results with models developed based on the RF, GB, and DT methods. We also independently compared the HNN method results with the literature reported deep learning protein-ligand binding affinity predictions made by the DLSCORE, KDEEP, and DeepAtom. The predictive performance of the HNN methods (max Pearson's R achieved was 0.86) was consistently better than or comparable to the DLSCORE, KDEEP, and DeepAtom deep learning learning methods for both balanced and unbalanced data sets. The HNN-affinity can be applied for the protein-ligand affinity prediction even in the absence of protein structure information, as it considers the protein sequence as standalone feature in addition to the ligand descriptors. The HNN-denovo method can be efficiently implemented to the structure-based de novo drug design campaign. The HNN-affinity method can be used in conjunction with the deep learning molecular docking protocols as a standalone. Further, it can be combined with the conventional molecular docking methods as a multistep approach to rapidly screen billions of diverse compounds. The HNN method are highly scalable in the cloud ML platform.
Assuntos
Aprendizado Profundo , Ligantes , Simulação de Acoplamento Molecular , Redes Neurais de Computação , Proteínas/química , Desenho de FármacosRESUMO
Accurate prediction of peptide binding affinity to the major histocompatibility complex (MHC) proteins has the potential to design better therapeutic vaccines. Previous work has shown that pan-specific prediction algorithms can achieve better prediction performance than other approaches. However, most of the top algorithms are neural networks based black box models. Here, we propose DeepAttentionPan, an improved pan-specific model, based on convolutional neural networks and attention mechanisms for more flexible, stable and interpretable MHC-I binding prediction. With the attention mechanism, our ensemble model consisting of 20 trained networks achieves high and more stabilized prediction performance. Extensive tests on IEDB's weekly benchmark dataset show that our method achieves state-of-the-art prediction performance on 21 test allele datasets. Analysis of the peptide positional attention weights learned by our model demonstrates its capability to capture critical binding positions of the peptides, which leads to mechanistic understanding of MHC-peptide binding with high alignment with experimentally verified results. Furthermore, we show that with transfer learning, our pan model can be fine-tuned for alleles with few samples to achieve additional performance improvement. DeepAttentionPan is freely available as an open-source software at https://github.com/jjin49/DeepAttentionPan.
Assuntos
Aprendizado Profundo , Antígenos HLA-A/química , Peptídeos/química , Alelos , Área Sob a Curva , Benchmarking , Sítios de Ligação , Bases de Dados de Proteínas , Conjuntos de Dados como Assunto , Antígenos HLA-A/imunologia , Antígenos HLA-A/metabolismo , Humanos , Peptídeos/imunologia , Peptídeos/metabolismo , Ligação ProteicaRESUMO
In the field of drug-target interactions prediction, the majority of approaches formulated the problem as a simple binary classification task. These methods used binary drug-target interaction datasets to train their models. The prediction of drug-target interactions is inherently a regression problem and these interactions would be identified according to the binding affinity between drugs and targets. This paper deals the binary drug-target interactions and tries to identify the binary interactions based on the binding strength of a drug and its target. To this end, we propose a semi-supervised transfer learning approach to predict the binding affinity in a continuous spectrum for binary interactions. Due to the lack of training data with continuous binding affinity in the target domain, the proposed method makes use of the information available in other domains (i.e. source domain), via the transfer learning approach. The general framework of our algorithm is based on an objective function, which considers the performance in both source and target domains as well as the unlabeled data in the target domain via a regularization term. To optimize this objective function, we make use of a gradient boosting machine which constructs the final model. To assess the performance of the proposed method, we have used some benchmark datasets with binary interactions for four classes of human proteins. Our algorithm identifies interactions in a more realistic situation. According to the experimental results, our regression model performs better than the state-of-the-art methods in some procedures.
Assuntos
Algoritmos , Interações Medicamentosas , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Aprendizado de Máquina Supervisionado , HumanosRESUMO
BACKGROUND: Immunotherapy is a promising route towards personalized cancer treatment. A key algorithmic challenge in this process is to decide if a given peptide (neoepitope) binds with the major histocompatibility complex (MHC). This is an active area of research and there are many MHC binding prediction algorithms that can predict the MHC binding affinity for a given peptide to a high degree of accuracy. However, most of the state-of-the-art approaches make use of complicated training and model selection procedures, are restricted to peptides of a certain length and/or rely on heuristics. RESULTS: We put forward USMPep, a simple recurrent neural network that reaches state-of-the-art approaches on MHC class I binding prediction with a single, generic architecture and even a single set of hyperparameters both on IEDB benchmark datasets and on the very recent HPV dataset. Moreover, the algorithm is competitive for a single model trained from scratch, while ensembling multiple regressors and language model pretraining can still slightly improve the performance. The direct application of the approach to MHC class II binding prediction shows a solid performance despite of limited training data. CONCLUSIONS: We demonstrate that competitive performance in MHC binding affinity prediction can be reached with a standard architecture and training procedure without relying on any heuristics.
Assuntos
Algoritmos , Antígenos de Histocompatibilidade Classe II/metabolismo , Antígenos de Histocompatibilidade Classe I/metabolismo , Modelos Genéticos , Alelos , Área Sob a Curva , Sequência de Bases , Bases de Dados Genéticas , Humanos , Peptídeos/metabolismo , Ligação Proteica , Curva ROCRESUMO
Accurate prediction of the binding affinity of a protein-ligand complex is essential for efficient and successful rational drug design. Therefore, many binding affinity prediction methods have been developed. In recent years, since deep learning technology has become powerful, it is also implemented to predict affinity. In this work, a new neural network model that predicts the binding affinity of a protein-ligand complex structure is developed. Our model predicts the binding affinity of a complex using the ensemble of multiple independently trained networks that consist of multiple channels of 3-D convolutional neural network layers. Our model was trained using the 3772 protein-ligand complexes from the refined set of the PDBbind-2016 database and tested using the core set of 285 complexes. The benchmark results show that the Pearson correlation coefficient between the predicted binding affinities by our model and the experimental data is 0.827, which is higher than the state-of-the-art binding affinity prediction scoring functions. Additionally, our method ranks the relative binding affinities of possible multiple binders of a protein quite accurately, comparable to the other scoring functions. Last, we measured which structural information is critical for predicting binding affinity and found that the complementarity between the protein and ligand is most important.
Assuntos
Redes Neurais de Computação , Ligação Proteica , Proteínas/química , Proteínas/metabolismo , Desenho Assistido por Computador , Bases de Dados de Proteínas , Aprendizado Profundo , Desenho de Fármacos , Descoberta de Drogas , Humanos , Ligantes , Simulação de Acoplamento Molecular , Simulação de Dinâmica Molecular , Interface Usuário-ComputadorRESUMO
BACKGROUND: Determining protein-protein interactions and their binding affinity are important in understanding cellular biological processes, discovery and design of novel therapeutics, protein engineering, and mutagenesis studies. Due to the time and effort required in wet lab experiments, computational prediction of binding affinity from sequence or structure is an important area of research. Structure-based methods, though more accurate than sequence-based techniques, are limited in their applicability due to limited availability of protein structure data. RESULTS: In this study, we propose a novel machine learning method for predicting binding affinity that uses protein 3D structure as privileged information at training time while expecting only protein sequence information during testing. Using the method, which is based on the framework of learning using privileged information (LUPI), we have achieved improved performance over corresponding sequence-based binding affinity prediction methods that do not have access to privileged information during training. Our experiments show that with the proposed framework which uses structure only during training, it is possible to achieve classification performance comparable to that which is obtained using structure-based features. Evaluation on an independent test set shows improved performance over the PPA-Pred2 method as well. CONCLUSIONS: The proposed method outperforms several baseline learners and a state-of-the-art binding affinity predictor not only in cross-validation, but also on an additional validation dataset, demonstrating the utility of the LUPI framework for problems that would benefit from classification using structure-based features. The implementation of LUPI developed for this work is expected to be useful in other areas of bioinformatics as well.
Assuntos
Algoritmos , Biologia Computacional/métodos , Aprendizado de Máquina , Proteínas/metabolismo , Sequência de Aminoácidos , Ligantes , Ligação Proteica , Proteínas/química , Curva ROC , Reprodutibilidade dos Testes , Máquina de Vetores de SuporteRESUMO
Physics-based free energy simulations have increasingly become an important tool for predicting binding affinity and the recent introduction of automated protocols has also paved the way towards a more widespread use in the pharmaceutical industry. The D3R 2016 Grand Challenge 2 provided an opportunity to blindly test the commercial free energy calculation protocol FEP+ and assess its performance relative to other affinity prediction methods. The present D3R free energy prediction challenge was built around two experimental data sets involving inhibitors of farnesoid X receptor (FXR) which is a promising anticancer drug target. The FXR binding site is predominantly hydrophobic with few conserved interaction motifs and strong induced fit effects making it a challenging target for molecular modeling and drug design. For both data sets, we achieved reasonable prediction accuracy (RMSD ≈ 1.4 kcal/mol, rank 3-4 according to RMSD out of 20 submissions) comparable to that of state-of-the-art methods in the field. Our D3R results boosted our confidence in the method and strengthen our desire to expand its applications in future in-house drug design projects.