RESUMEN
MOTIVATION: Drug discovery has witnessed intensive exploration of predictive modeling of drug-target physical interactions over two decades. However, a critical knowledge gap needs to be filled for correlating drug-target interactions with clinical outcomes: predicting genome-wide receptor activities or function selectivity, especially agonist versus antagonist, induced by novel chemicals. Two major obstacles compound the difficulty on this task: known data of receptor activity is far too scarce to train a robust model in light of genome-scale applications, and real-world applications need to deploy a model on data from various shifted distributions. RESULTS: To address these challenges, we have developed an end-to-end deep learning framework, DeepREAL, for multi-scale modeling of genome-wide ligand-induced receptor activities. DeepREAL utilizes self-supervised learning on tens of millions of protein sequences and pre-trained binary interaction classification to solve the data distribution shift and data scarcity problems. Extensive benchmark studies on G-protein coupled receptors (GPCRs), which simulate real-world scenarios, demonstrate that DeepREAL achieves state-of-the-art performances in out-of-distribution settings. DeepREAL can be extended to other gene families beyond GPCRs. AVAILABILITY AND IMPLEMENTATION: All data used are downloaded from Pfam (Mistry et al., 2020), GLASS (Chan et al., 2015) and IUPHAR/BPS and the data from reference (Sakamuru et al., 2021). Readers are directed to their official website for original data. Code is available on GitHub https://github.com/XieResearchGroup/DeepREAL. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Aprendizaje Profundo , Ligandos , Secuencia de Aminoácidos , Descubrimiento de Drogas , Receptores Acoplados a Proteínas G/metabolismoRESUMEN
Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.