RESUMO
MOTIVATION: Retrosynthesis identifies available precursor molecules for various and novel compounds. With the advancements and practicality of language models, Transformer-based models have increasingly been used to automate this process. However, many existing methods struggle to efficiently capture reaction transformation information, limiting the accuracy and applicability of their predictions. RESULTS: We introduce RetroCaptioner, an advanced end-to-end, Transformer-based framework featuring a Contrastive Reaction Center Captioner. This captioner guides the training of dual-view attention models using a contrastive learning approach. It leverages learned molecular graph representations to capture chemically plausible constraints within a single-step learning process. We integrate the single-encoder, dual-encoder, and encoder-decoder paradigms to effectively fuse information from the sequence and graph representations of molecules. This involves modifying the Transformer encoder into a uni-view sequence encoder and a dual-view module. Furthermore, we enhance the captioning of atomic correspondence between SMILES and graphs. Our proposed method, RetroCaptioner, achieved outstanding performance with 67.2% in top-1 and 93.4% in top-10 exact matched accuracy on the USPTO-50k dataset, alongside an exceptional SMILES validity score of 99.4%. In addition, RetroCaptioner has demonstrated its reliability in generating synthetic routes for the drug protokylol. AVAILABILITY AND IMPLEMENTATION: The code and data are available at https://github.com/guofei-tju/RetroCaptioner.
Assuntos
Software , Algoritmos , Aprendizado de MáquinaRESUMO
Protein-DNA interactions play crucial roles in the biological systems, and identifying protein-DNA binding sites is the first step for mechanistic understanding of various biological activities (such as transcription and repair) and designing novel drugs. How to accurately identify DNA-binding residues from only protein sequence remains a challenging task. Currently, most existing sequence-based methods only consider contextual features of the sequential neighbors, which are limited to capture spatial information. Based on the recent breakthrough in protein structure prediction by AlphaFold2, we propose an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2. Here, we convert the binding site prediction problem into a graph node classification task and employ a transformer-based variant model to take the protein structural information into account. By leveraging predicted protein structures and graph transformer, GraphSite substantially improves over the latest sequence-based and structure-based methods. The algorithm is further confirmed on the independent test set of 181 proteins, where GraphSite surpasses the state-of-the-art structure-based method by 16.4% in area under the precision-recall curve and 11.2% in Matthews correlation coefficient, respectively. We provide the datasets, the predicted structures and the source codes along with the pre-trained models of GraphSite at https://github.com/biomed-AI/GraphSite. The GraphSite web server is freely available at https://biomed.nscc-gz.cn/apps/GraphSite.
Assuntos
Algoritmos , Proteínas , Sítios de Ligação , DNA/metabolismo , Ligação Proteica , Domínios Proteicos , Proteínas/químicaRESUMO
Self-supervised molecular representation learning has demonstrated great promise in bridging machine learning and chemical science to accelerate the development of new drugs. Due to the limited reaction data, existing methods are mostly pretrained by augmenting the intrinsic topology of molecules without effectively incorporating chemical reaction prior information, which makes them difficult to generalize to chemical reaction-related tasks. To address this issue, we propose ReaKE, a reaction knowledge embedding framework, which formulates chemical reactions as a knowledge graph. Specifically, we constructed a chemical synthesis knowledge graph with reactants and products as nodes and reaction rules as the edges. Based on the knowledge graph, we further proposed novel contrastive learning at both molecule and reaction levels to capture the reaction-related functional group information within and between molecules. Extensive experiments demonstrate the effectiveness of ReaKE compared with state-of-the-art methods on several downstream tasks, including reaction classification, product prediction, and yield prediction.
Assuntos
Aprendizado de Máquina , Reconhecimento Automatizado de PadrãoRESUMO
Biomedical knowledge graphs (KGs), which can help with the understanding of complex biological systems and pathologies, have begun to play a critical role in medical practice and research. However, challenges remain in their embedding and use due to their complex nature and the specific demands of their construction. Existing studies often suffer from problems such as sparse and noisy datasets, insufficient modeling methods and non-uniform evaluation metrics. In this work, we established a comprehensive KG system for the biomedical field in an attempt to bridge the gap. Here, we introduced PharmKG, a multi-relational, attributed biomedical KG, composed of more than 500 000 individual interconnections between genes, drugs and diseases, with 29 relation types over a vocabulary of ~8000 disambiguated entities. Each entity in PharmKG is attached with heterogeneous, domain-specific information obtained from multi-omics data, i.e. gene expression, chemical structure and disease word embedding, while preserving the semantic and biomedical features. For baselines, we offered nine state-of-the-art KG embedding (KGE) approaches and a new biological, intuitive, graph neural network-based KGE method that uses a combination of both global network structure and heterogeneous domain features. Based on the proposed benchmark, we conducted extensive experiments to assess these KGE models using multiple evaluation metrics. Finally, we discussed our observations across various downstream biological tasks and provide insights and guidelines for how to use a KG in biomedicine. We hope that the unprecedented quality and diversity of PharmKG will lead to advances in biomedical KG construction, embedding and application.
Assuntos
Pesquisa Biomédica , Mineração de Dados , Redes Neurais de Computação , Semântica , Software , Benchmarking , HumanosRESUMO
MOTIVATION: The solvent accessible surface is an essential structural property measure related to the protein structure and protein function. Relative solvent accessible area (RSA) is a standard measure to describe the degree of residue exposure in the protein surface or inside of protein. However, this computation will fail when the residues information is missing. RESULTS: In this article, we proposed a novel method for estimation RSA using the Cα atom distance matrix with the deep learning method (EAGERER). The new method, EAGERER, achieves Pearson correlation coefficients of 0.921-0.928 on two independent test datasets. We empirically demonstrate that EAGERER can yield better Pearson correlation coefficients than existing RSA estimators, such as coordination number, half sphere exposure and SphereCon. To the best of our knowledge, EAGERER represents the first method to estimate the solvent accessible area using limited information with a deep learning model. It could be useful to the protein structure and protein function prediction. AVAILABILITYAND IMPLEMENTATION: The method is free available at https://github.com/cliffgao/EAGERER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Aprendizado Profundo , Proteínas de Membrana , Solventes/químicaRESUMO
Fragment-based drug discovery is a widely used strategy for drug design in both academic and pharmaceutical industries. Although fragments can be linked to generate candidate compounds by the latest deep generative models, generating linkers with specified attributes remains underdeveloped. In this study, we presented a novel framework, DRlinker, to control fragment linking toward compounds with given attributes through reinforcement learning. The method has been shown to be effective for many tasks from controlling the linker length and log P, optimizing predicted bioactivity of compounds, to various multiobjective tasks. Specifically, our model successfully generated 91.0% and 93.9% of compounds complying with the desired linker length and log P and improved the 7.5 pChEMBL value in bioactivity optimization. Finally, a quasi-scaffold-hopping study revealed that DRlinker could generate nearly 30% molecules with high 3D similarity but low 2D similarity to the lead inhibitor, demonstrating the benefits and applicability of DRlinker in actual fragment-based drug design.
Assuntos
Desenho de Fármacos , Descoberta de DrogasRESUMO
Identifying drug-protein interactions (DPIs) is crucial in drug discovery, and a number of machine learning methods have been developed to predict DPIs. Existing methods usually use unrealistic data sets with hidden bias, which will limit the accuracy of virtual screening methods. Meanwhile, most DPI prediction methods pay more attention to molecular representation but lack effective research on protein representation and high-level associations between different instances. To this end, we present the novel structure-aware multimodal deep DPI prediction model, STAMP-DPI, which was trained on a curated industry-scale benchmark data set. We built a high-quality benchmark data set named GalaxyDB for DPI prediction. This industry-scale data set along with an unbiased training procedure resulted in a more robust benchmark study. For informative protein representation, we constructed a structure-aware graph neural network method from the protein sequence by combining predicted contact maps and graph neural networks. Through further integration of structure-based representation and high-level pretrained embeddings for molecules and proteins, our model effectively captures the feature representation of the interactions between them. As a result, STAMP-DPI outperformed state-of-the-art DPI prediction methods by decreasing 7.00% mean square error (MSE) in the Davis data set and improving 8.89% area under the curve (AUC) in the GalaxyDB data set. Moreover, our model is an interpretable model with the transformer-based interaction mechanism, which can accurately reveal the binding sites between molecules and proteins.
Assuntos
Aprendizado Profundo , Sequência de Aminoácidos , Aprendizado de Máquina , Redes Neurais de Computação , Proteínas/químicaRESUMO
The goal of molecular optimization (MO) is to discover molecules that acquire improved pharmaceutical properties over a known starting molecule. Despite many recent successes of new approaches for MO, these methods were typically developed for particular properties with rich annotated training examples. Thus, these approaches are difficult to implement in real scenes where only a small amount of pharmaceutical data is usually available due to the expense and significant effort required for the data collection. Here, we propose a new approach, Meta-MO, for molecular optimization with a handful of training samples based on the well-recognized first-order meta-learning algorithms. By using a set of meta tasks with rich training samples, Meta-MO trains a meta model through the meta-learning optimization and adapts the learned model to new low-resource MO tasks. Meta-MO was shown to consistently outperform several pretraining and multitask training procedures, providing an average improvement in the success rate of 4.3% on a large-scale bioactivity data set with diverse target variations. We also observed that Meta-MO resulted in the best performing models across fine-tuning sets with only dozens of samples. To the best of our knowledge, this is the first study to apply meta learning to MO tasks. More importantly, such a strategy could be further extended to many low-resource scenarios in real-world drug design.
Assuntos
AlgoritmosRESUMO
The protein kinase family contains many promising drug targets. Many kinase inhibitors target the ATP-binding pocket, leading to approved drugs in past decades. Scaffold hopping is an effective approach for drug design. The kinase ATP-binding pocket is highly conserved, crossing the whole kinase family. This provides an opportunity to develop a scaffold hopping approach to explore diversified scaffolds among various kinase inhibitors. In this work, we report the SyntaLinker-Hybrid scheme for kinase inhibitor scaffold hopping. With this scheme, we replace molecular fragments bound at the conserved kinase hinge region with deep generative models. Thus, we are able to generate new kinase-inhibitor-like structures hybridizing the privileged fragments against the hinge region. We demonstrate that this scheme allows generation of kinase-inhibitor-like molecules with novel scaffolds, while retaining the binding features of existing kinase inhibitors. This work can be employed in lead identification against kinase targets.
Assuntos
Aprendizado Profundo , Desenho de Fármacos , Ligação Proteica , Inibidores de Proteínas Quinases/farmacologia , Proteínas QuinasesRESUMO
Synthesis planning is the process of recursively decomposing target molecules into available precursors. Computer-aided retrosynthesis can potentially assist chemists in designing synthetic routes; however, at present, it is cumbersome and cannot provide satisfactory results. In this study, we have developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retrosynthesis using transformer neural networks. In the method, the retrosynthesis planning was converted to a machine translation problem from the products to molecular linear notations of the reactants. By coupling with a neural network-based syntax corrector, our method achieved an accuracy of 59.0% on a standard benchmark data set, which outperformed other deep learning methods by >21% and template-based methods by >6%. More importantly, our method was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.
Assuntos
Técnicas de Química Sintética/métodos , Redes Neurais de Computação , Conjuntos de Dados como AssuntoRESUMO
The copper(I)-catalyzed alkyne-azide cycloaddition (CuAAC) reaction, a major click chemistry reaction, is widely employed in drug discovery and chemical biology. However, the success rate of the CuAAC reaction is not satisfactory as expected, and in order to improve its performance, we developed a recurrent neural network (RNN) model to predict its feasibility. First, we designed and synthesized a structurally diverse library of 700 compounds with the CuAAC reaction to obtain experimental data. Then, using reaction SMILES as input, we generated a bidirectional long-short-term memory with a self-attention mechanism (BiLSTM-SA) model. Our best prediction model has total accuracy of 80%. With the self-attention mechanism, adverse substructures responsible for negative reactions were recognized and derived as quantitative descriptors. Density functional theory investigations were conducted to provide evidence for the correlation between bromo-α-C hybrid types and the success rate of the reaction. Quantitative descriptors combined with RDKit descriptors were fed to three machine learning models, a support vector machine, random forest, and logistic regression, and resulted in improved performance. The BiLSTM-SA model for predicting the feasibility of the CuAAC reaction is superior to other conventional learning methods and advances heuristic chemical rules.
Assuntos
Alcinos , Azidas , Catálise , Química Click , Cobre , Reação de Cicloadição , Estudos de Viabilidade , Redes Neurais de ComputaçãoRESUMO
Recognizing substructures and their relations embedded in a molecular structure representation is a key process for structure-activity or structure-property relationship (SAR/SPR) studies. A molecular structure can be explicitly represented as either a connection table (CT) or linear notation, such as SMILES, which is a language describing the connectivity of atoms in the molecular structure. Conventional SAR/SPR approaches rely on partitioning the CT into a set of predefined substructures as structural descriptors. In this work, we propose a new method to identifying SAR/SPR through linear notation (for example, SMILES) syntax analysis with self-attention mechanism, an interpretable deep learning architecture. The method has been evaluated by predicting chemical properties, toxicology, and bioactivity from experimental data sets. Our results demonstrate that the method yields superior performance compared with state-of-the-art models. Moreover, the method can produce chemically interpretable results, which can be used for a chemist to design and synthesize the activity- or property-improved compounds.
Assuntos
Quimioinformática/métodos , Aprendizado Profundo , Solubilidade , Relação Estrutura-Atividade , Água/químicaRESUMO
Illuminating synthetic pathways is essential for producing valuable chemicals, such as bioactive molecules. Chemical and biological syntheses are crucial, and their integration often leads to more efficient and sustainable pathways. Despite the rapid development of retrosynthesis models, few of them consider both chemical and biological syntheses, hindering the pathway design for high-value chemicals. Here, we propose BioNavi by innovating multitask learning and reaction templates into the deep learning-driven model to design hybrid synthesis pathways in a more interpretable manner. BioNavi outperforms existing approaches on different data sets, achieving a 75% hit rate in replicating reported biosynthetic pathways and displaying superior ability in designing hybrid synthesis pathways. Additional case studies further illustrate the potential application of BioNavi in a de novo pathway design. The enhanced web server (http://biopathnavi.qmclab.com/bionavi/) simplifies input operations and implements step-by-step exploration according to user experience. We show that BioNavi is a handy navigator for designing synthetic pathways for various chemicals.
RESUMO
Protein functions are characterized by interactions with proteins, drugs, and other biomolecules. Understanding these interactions is essential for deciphering the molecular mechanisms underlying biological processes and developing new therapeutic strategies. Current computational methods mostly predict interactions based on either molecular network or structural information, without integrating them within a unified multi-scale framework. While a few multi-view learning methods are devoted to fusing the multi-scale information, these methods tend to rely intensively on a single scale and under-fitting the others, likely attributed to the imbalanced nature and inherent greediness of multi-scale learning. To alleviate the optimization imbalance, we present MUSE, a multi-scale representation learning framework based on a variant expectation maximization to optimize different scales in an alternating procedure over multiple iterations. This strategy efficiently fuses multi-scale information between atomic structure and molecular network scale through mutual supervision and iterative optimization. MUSE outperforms the current state-of-the-art models not only in molecular interaction (protein-protein, drug-protein, and drug-drug) tasks but also in protein interface prediction at the atomic structure scale. More importantly, the multi-scale learning framework shows potential for extension to other scales of computational drug discovery.
Assuntos
Biologia Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Algoritmos , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Aprendizado de Máquina , Interações Medicamentosas , Humanos , Ligação ProteicaRESUMO
While significant advances have been made in predicting static protein structures, the inherent dynamics of proteins, modulated by ligands, are crucial for understanding protein function and facilitating drug discovery. Traditional docking methods, frequently used in studying protein-ligand interactions, typically treat proteins as rigid. While molecular dynamics simulations can propose appropriate protein conformations, they're computationally demanding due to rare transitions between biologically relevant equilibrium states. In this study, we present DynamicBind, a deep learning method that employs equivariant geometric diffusion networks to construct a smooth energy landscape, promoting efficient transitions between different equilibrium states. DynamicBind accurately recovers ligand-specific conformations from unbound protein structures without the need for holo-structures or extensive sampling. Remarkably, it demonstrates state-of-the-art performance in docking and virtual screening benchmarks. Our experiments reveal that DynamicBind can accommodate a wide range of large protein conformational changes and identify cryptic pockets in unseen protein targets. As a result, DynamicBind shows potential in accelerating the development of small molecules for previously undruggable targets and expanding the horizons of computational drug discovery.
Assuntos
Simulação de Dinâmica Molecular , Proteínas , Ligantes , Proteínas/metabolismo , Conformação Proteica , Descoberta de Drogas , Ligação Proteica , Simulação de Acoplamento MolecularRESUMO
Constructing discriminative representations of molecules lies at the core of a number of domains such as drug discovery, chemistry, and medicine. State-of-the-art methods employ graph neural networks and self-supervised learning (SSL) to learn unlabeled data for structural representations, which can then be fine-tuned for downstream tasks. Albeit powerful, these methods are pre-trained solely on molecular structures and thus often struggle with tasks involved in intricate biological processes. Here, it is proposed to assist the learning of molecular representation by using the perturbed high-content cell microscopy images at the phenotypic level. To incorporate the cross-modal pre-training, a unified framework is constructed to align them through multiple types of contrastive loss functions, which is proven effective in the formulated novel tasks to retrieve the molecules and corresponding images mutually. More importantly, the model can infer functional molecules according to cellular images generated by genetic perturbations. In parallel, the proposed model can transfer non-trivially to molecular property predictions, and has shown great improvement over clinical outcome predictions. These results suggest that such cross-modality learning can bridge molecules and phenotype to play important roles in drug discovery.
Assuntos
Redes Neurais de Computação , Humanos , Processamento de Imagem Assistida por Computador/métodos , Microscopia/métodos , Algoritmos , Aprendizado de MáquinaRESUMO
Jiahua Rao and Shuangjia Zheng are Ph.D. students in Prof. Yang's lab (Supercomputing And AI for Life science, SAIL Lab) at Sun Yat-sen University. They recently developed an interpretable framework to quantitatively assess the interpretability of Graph Neural Network (GNN) and made comparison with medicinal chemists. Their meaningful benchmarking and rigorous framework would greatly benefit development of new interpretable methods in GNNs.
RESUMO
Graph neural networks (GNNs) have received increasing attention because of their expressive power on topological data, but they are still criticized for their lack of interpretability. To interpret GNN models, explainable artificial intelligence (XAI) methods have been developed. However, these methods are limited to qualitative analyses without quantitative assessments from the real-world datasets due to a lack of ground truths. In this study, we have established five XAI-specific molecular property benchmarks, including two synthetic and three experimental datasets. Through the datasets, we quantitatively assessed six XAI methods on four GNN models and made comparisons with seven medicinal chemists of different experience levels. The results demonstrated that XAI methods could deliver reliable and informative answers for medicinal chemists in identifying the key substructures. Moreover, the identified substructures were shown to complement existing classical fingerprints to improve molecular property predictions, and the improvements increased with the growth of training data.
RESUMO
The complete biosynthetic pathways are unknown for most natural products (NPs), it is thus valuable to make computer-aided bio-retrosynthesis predictions. Here, a navigable and user-friendly toolkit, BioNavi-NP, is developed to predict the biosynthetic pathways for both NPs and NP-like compounds. First, a single-step bio-retrosynthesis prediction model is trained using both general organic and biosynthetic reactions through end-to-end transformer neural networks. Based on this model, plausible biosynthetic pathways can be efficiently sampled through an AND-OR tree-based planning algorithm from iterative multi-step bio-retrosynthetic routes. Extensive evaluations reveal that BioNavi-NP can identify biosynthetic pathways for 90.2% of 368 test compounds and recover the reported building blocks as in the test set for 72.8%, 1.7 times more accurate than existing conventional rule-based approaches. The model is further shown to identify biologically plausible pathways for complex NPs collected from the recent literature. The toolkit as well as the curated datasets and learned models are freely available to facilitate the elucidation and reconstruction of the biosynthetic pathways for NPs.
Assuntos
Produtos Biológicos , Aprendizado Profundo , Algoritmos , Vias Biossintéticas , Redes Neurais de ComputaçãoRESUMO
MOTIVATION: The interactions of proteins with DNA, RNA, peptide, and carbohydrate play key roles in various biological processes. The studies of uncharacterized protein-molecules interactions could be aided by accurate predictions of residues that bind with partner molecules. However, the existing methods for predicting binding residues on proteins remain of relatively low accuracies due to the limited number of complex structures in databases. As different types of molecules partially share chemical mechanisms, the predictions for each molecular type should benefit from the binding information with other molecule types. RESULTS: In this study, we employed a multiple task deep learning strategy to develop a new sequence-based method for simultaneously predicting binding residues/sites with multiple important molecule types named MTDsite. By combining four training sets for DNA, RNA, peptide, and carbohydrate-binding proteins, our method yielded accurate and robust predictions with AUC values of 0.852, 0836, 0.758, and 0.776 on their respective independent test sets, which are 0.52 to 6.6% better than other state-of-the-art methods. To my best knowledge, this is the first method using multi-task framework to predict multiple molecular binding sites simultaneously.