Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 24
Filter
1.
Brief Bioinform ; 23(2)2022 03 10.
Article in English | MEDLINE | ID: mdl-35039821

ABSTRACT

Protein-DNA interactions play crucial roles in the biological systems, and identifying protein-DNA binding sites is the first step for mechanistic understanding of various biological activities (such as transcription and repair) and designing novel drugs. How to accurately identify DNA-binding residues from only protein sequence remains a challenging task. Currently, most existing sequence-based methods only consider contextual features of the sequential neighbors, which are limited to capture spatial information. Based on the recent breakthrough in protein structure prediction by AlphaFold2, we propose an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2. Here, we convert the binding site prediction problem into a graph node classification task and employ a transformer-based variant model to take the protein structural information into account. By leveraging predicted protein structures and graph transformer, GraphSite substantially improves over the latest sequence-based and structure-based methods. The algorithm is further confirmed on the independent test set of 181 proteins, where GraphSite surpasses the state-of-the-art structure-based method by 16.4% in area under the precision-recall curve and 11.2% in Matthews correlation coefficient, respectively. We provide the datasets, the predicted structures and the source codes along with the pre-trained models of GraphSite at https://github.com/biomed-AI/GraphSite. The GraphSite web server is freely available at https://biomed.nscc-gz.cn/apps/GraphSite.


Subject(s)
Algorithms , Proteins , Binding Sites , DNA/metabolism , Protein Binding , Protein Domains , Proteins/chemistry
2.
J Chem Inf Model ; 64(6): 1945-1954, 2024 Mar 25.
Article in English | MEDLINE | ID: mdl-38484468

ABSTRACT

Self-supervised molecular representation learning has demonstrated great promise in bridging machine learning and chemical science to accelerate the development of new drugs. Due to the limited reaction data, existing methods are mostly pretrained by augmenting the intrinsic topology of molecules without effectively incorporating chemical reaction prior information, which makes them difficult to generalize to chemical reaction-related tasks. To address this issue, we propose ReaKE, a reaction knowledge embedding framework, which formulates chemical reactions as a knowledge graph. Specifically, we constructed a chemical synthesis knowledge graph with reactants and products as nodes and reaction rules as the edges. Based on the knowledge graph, we further proposed novel contrastive learning at both molecule and reaction levels to capture the reaction-related functional group information within and between molecules. Extensive experiments demonstrate the effectiveness of ReaKE compared with state-of-the-art methods on several downstream tasks, including reaction classification, product prediction, and yield prediction.


Subject(s)
Machine Learning , Pattern Recognition, Automated
3.
Brief Bioinform ; 22(4)2021 07 20.
Article in English | MEDLINE | ID: mdl-33341877

ABSTRACT

Biomedical knowledge graphs (KGs), which can help with the understanding of complex biological systems and pathologies, have begun to play a critical role in medical practice and research. However, challenges remain in their embedding and use due to their complex nature and the specific demands of their construction. Existing studies often suffer from problems such as sparse and noisy datasets, insufficient modeling methods and non-uniform evaluation metrics. In this work, we established a comprehensive KG system for the biomedical field in an attempt to bridge the gap. Here, we introduced PharmKG, a multi-relational, attributed biomedical KG, composed of more than 500 000 individual interconnections between genes, drugs and diseases, with 29 relation types over a vocabulary of ~8000 disambiguated entities. Each entity in PharmKG is attached with heterogeneous, domain-specific information obtained from multi-omics data, i.e. gene expression, chemical structure and disease word embedding, while preserving the semantic and biomedical features. For baselines, we offered nine state-of-the-art KG embedding (KGE) approaches and a new biological, intuitive, graph neural network-based KGE method that uses a combination of both global network structure and heterogeneous domain features. Based on the proposed benchmark, we conducted extensive experiments to assess these KGE models using multiple evaluation metrics. Finally, we discussed our observations across various downstream biological tasks and provide insights and guidelines for how to use a KG in biomedicine. We hope that the unprecedented quality and diversity of PharmKG will lead to advances in biomedical KG construction, embedding and application.


Subject(s)
Biomedical Research , Data Mining , Neural Networks, Computer , Semantics , Software , Benchmarking , Humans
4.
Bioinformatics ; 38(1): 94-98, 2021 12 22.
Article in English | MEDLINE | ID: mdl-34450651

ABSTRACT

MOTIVATION: The solvent accessible surface is an essential structural property measure related to the protein structure and protein function. Relative solvent accessible area (RSA) is a standard measure to describe the degree of residue exposure in the protein surface or inside of protein. However, this computation will fail when the residues information is missing. RESULTS: In this article, we proposed a novel method for estimation RSA using the Cα atom distance matrix with the deep learning method (EAGERER). The new method, EAGERER, achieves Pearson correlation coefficients of 0.921-0.928 on two independent test datasets. We empirically demonstrate that EAGERER can yield better Pearson correlation coefficients than existing RSA estimators, such as coordination number, half sphere exposure and SphereCon. To the best of our knowledge, EAGERER represents the first method to estimate the solvent accessible area using limited information with a deep learning model. It could be useful to the protein structure and protein function prediction. AVAILABILITYAND IMPLEMENTATION: The method is free available at https://github.com/cliffgao/EAGERER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Membrane Proteins , Solvents/chemistry
5.
J Chem Inf Model ; 62(23): 5907-5917, 2022 Dec 12.
Article in English | MEDLINE | ID: mdl-36404642

ABSTRACT

Fragment-based drug discovery is a widely used strategy for drug design in both academic and pharmaceutical industries. Although fragments can be linked to generate candidate compounds by the latest deep generative models, generating linkers with specified attributes remains underdeveloped. In this study, we presented a novel framework, DRlinker, to control fragment linking toward compounds with given attributes through reinforcement learning. The method has been shown to be effective for many tasks from controlling the linker length and log P, optimizing predicted bioactivity of compounds, to various multiobjective tasks. Specifically, our model successfully generated 91.0% and 93.9% of compounds complying with the desired linker length and log P and improved the 7.5 pChEMBL value in bioactivity optimization. Finally, a quasi-scaffold-hopping study revealed that DRlinker could generate nearly 30% molecules with high 3D similarity but low 2D similarity to the lead inhibitor, demonstrating the benefits and applicability of DRlinker in actual fragment-based drug design.


Subject(s)
Drug Design , Drug Discovery
6.
J Chem Inf Model ; 62(5): 1308-1317, 2022 03 14.
Article in English | MEDLINE | ID: mdl-35200015

ABSTRACT

Identifying drug-protein interactions (DPIs) is crucial in drug discovery, and a number of machine learning methods have been developed to predict DPIs. Existing methods usually use unrealistic data sets with hidden bias, which will limit the accuracy of virtual screening methods. Meanwhile, most DPI prediction methods pay more attention to molecular representation but lack effective research on protein representation and high-level associations between different instances. To this end, we present the novel structure-aware multimodal deep DPI prediction model, STAMP-DPI, which was trained on a curated industry-scale benchmark data set. We built a high-quality benchmark data set named GalaxyDB for DPI prediction. This industry-scale data set along with an unbiased training procedure resulted in a more robust benchmark study. For informative protein representation, we constructed a structure-aware graph neural network method from the protein sequence by combining predicted contact maps and graph neural networks. Through further integration of structure-based representation and high-level pretrained embeddings for molecules and proteins, our model effectively captures the feature representation of the interactions between them. As a result, STAMP-DPI outperformed state-of-the-art DPI prediction methods by decreasing 7.00% mean square error (MSE) in the Davis data set and improving 8.89% area under the curve (AUC) in the GalaxyDB data set. Moreover, our model is an interpretable model with the transformer-based interaction mechanism, which can accurately reveal the binding sites between molecules and proteins.


Subject(s)
Deep Learning , Amino Acid Sequence , Machine Learning , Neural Networks, Computer , Proteins/chemistry
7.
J Chem Inf Model ; 61(4): 1627-1636, 2021 04 26.
Article in English | MEDLINE | ID: mdl-33729779

ABSTRACT

The goal of molecular optimization (MO) is to discover molecules that acquire improved pharmaceutical properties over a known starting molecule. Despite many recent successes of new approaches for MO, these methods were typically developed for particular properties with rich annotated training examples. Thus, these approaches are difficult to implement in real scenes where only a small amount of pharmaceutical data is usually available due to the expense and significant effort required for the data collection. Here, we propose a new approach, Meta-MO, for molecular optimization with a handful of training samples based on the well-recognized first-order meta-learning algorithms. By using a set of meta tasks with rich training samples, Meta-MO trains a meta model through the meta-learning optimization and adapts the learned model to new low-resource MO tasks. Meta-MO was shown to consistently outperform several pretraining and multitask training procedures, providing an average improvement in the success rate of 4.3% on a large-scale bioactivity data set with diverse target variations. We also observed that Meta-MO resulted in the best performing models across fine-tuning sets with only dozens of samples. To the best of our knowledge, this is the first study to apply meta learning to MO tasks. More importantly, such a strategy could be further extended to many low-resource scenarios in real-world drug design.


Subject(s)
Algorithms
8.
J Chem Inf Model ; 61(10): 4900-4912, 2021 10 25.
Article in English | MEDLINE | ID: mdl-34586824

ABSTRACT

The protein kinase family contains many promising drug targets. Many kinase inhibitors target the ATP-binding pocket, leading to approved drugs in past decades. Scaffold hopping is an effective approach for drug design. The kinase ATP-binding pocket is highly conserved, crossing the whole kinase family. This provides an opportunity to develop a scaffold hopping approach to explore diversified scaffolds among various kinase inhibitors. In this work, we report the SyntaLinker-Hybrid scheme for kinase inhibitor scaffold hopping. With this scheme, we replace molecular fragments bound at the conserved kinase hinge region with deep generative models. Thus, we are able to generate new kinase-inhibitor-like structures hybridizing the privileged fragments against the hinge region. We demonstrate that this scheme allows generation of kinase-inhibitor-like molecules with novel scaffolds, while retaining the binding features of existing kinase inhibitors. This work can be employed in lead identification against kinase targets.


Subject(s)
Deep Learning , Drug Design , Protein Binding , Protein Kinase Inhibitors/pharmacology , Protein Kinases
9.
J Chem Inf Model ; 60(1): 47-55, 2020 01 27.
Article in English | MEDLINE | ID: mdl-31825611

ABSTRACT

Synthesis planning is the process of recursively decomposing target molecules into available precursors. Computer-aided retrosynthesis can potentially assist chemists in designing synthetic routes; however, at present, it is cumbersome and cannot provide satisfactory results. In this study, we have developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retrosynthesis using transformer neural networks. In the method, the retrosynthesis planning was converted to a machine translation problem from the products to molecular linear notations of the reactants. By coupling with a neural network-based syntax corrector, our method achieved an accuracy of 59.0% on a standard benchmark data set, which outperformed other deep learning methods by >21% and template-based methods by >6%. More importantly, our method was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.


Subject(s)
Chemistry Techniques, Synthetic/methods , Neural Networks, Computer , Datasets as Topic
10.
J Chem Inf Model ; 60(3): 1165-1174, 2020 03 23.
Article in English | MEDLINE | ID: mdl-32013419

ABSTRACT

The copper(I)-catalyzed alkyne-azide cycloaddition (CuAAC) reaction, a major click chemistry reaction, is widely employed in drug discovery and chemical biology. However, the success rate of the CuAAC reaction is not satisfactory as expected, and in order to improve its performance, we developed a recurrent neural network (RNN) model to predict its feasibility. First, we designed and synthesized a structurally diverse library of 700 compounds with the CuAAC reaction to obtain experimental data. Then, using reaction SMILES as input, we generated a bidirectional long-short-term memory with a self-attention mechanism (BiLSTM-SA) model. Our best prediction model has total accuracy of 80%. With the self-attention mechanism, adverse substructures responsible for negative reactions were recognized and derived as quantitative descriptors. Density functional theory investigations were conducted to provide evidence for the correlation between bromo-α-C hybrid types and the success rate of the reaction. Quantitative descriptors combined with RDKit descriptors were fed to three machine learning models, a support vector machine, random forest, and logistic regression, and resulted in improved performance. The BiLSTM-SA model for predicting the feasibility of the CuAAC reaction is superior to other conventional learning methods and advances heuristic chemical rules.


Subject(s)
Alkynes , Azides , Catalysis , Click Chemistry , Copper , Cycloaddition Reaction , Feasibility Studies , Neural Networks, Computer
11.
J Chem Inf Model ; 59(2): 914-923, 2019 02 25.
Article in English | MEDLINE | ID: mdl-30669836

ABSTRACT

Recognizing substructures and their relations embedded in a molecular structure representation is a key process for structure-activity or structure-property relationship (SAR/SPR) studies. A molecular structure can be explicitly represented as either a connection table (CT) or linear notation, such as SMILES, which is a language describing the connectivity of atoms in the molecular structure. Conventional SAR/SPR approaches rely on partitioning the CT into a set of predefined substructures as structural descriptors. In this work, we propose a new method to identifying SAR/SPR through linear notation (for example, SMILES) syntax analysis with self-attention mechanism, an interpretable deep learning architecture. The method has been evaluated by predicting chemical properties, toxicology, and bioactivity from experimental data sets. Our results demonstrate that the method yields superior performance compared with state-of-the-art models. Moreover, the method can produce chemically interpretable results, which can be used for a chemist to design and synthesize the activity- or property-improved compounds.


Subject(s)
Cheminformatics/methods , Deep Learning , Solubility , Structure-Activity Relationship , Water/chemistry
12.
Nat Commun ; 15(1): 4476, 2024 May 25.
Article in English | MEDLINE | ID: mdl-38796523

ABSTRACT

Protein functions are characterized by interactions with proteins, drugs, and other biomolecules. Understanding these interactions is essential for deciphering the molecular mechanisms underlying biological processes and developing new therapeutic strategies. Current computational methods mostly predict interactions based on either molecular network or structural information, without integrating them within a unified multi-scale framework. While a few multi-view learning methods are devoted to fusing the multi-scale information, these methods tend to rely intensively on a single scale and under-fitting the others, likely attributed to the imbalanced nature and inherent greediness of multi-scale learning. To alleviate the optimization imbalance, we present MUSE, a multi-scale representation learning framework based on a variant expectation maximization to optimize different scales in an alternating procedure over multiple iterations. This strategy efficiently fuses multi-scale information between atomic structure and molecular network scale through mutual supervision and iterative optimization. MUSE outperforms the current state-of-the-art models not only in molecular interaction (protein-protein, drug-protein, and drug-drug) tasks but also in protein interface prediction at the atomic structure scale. More importantly, the multi-scale learning framework shows potential for extension to other scales of computational drug discovery.


Subject(s)
Computational Biology , Proteins , Proteins/chemistry , Proteins/metabolism , Computational Biology/methods , Algorithms , Pharmaceutical Preparations/chemistry , Pharmaceutical Preparations/metabolism , Machine Learning , Drug Interactions , Humans , Protein Binding
13.
Nat Commun ; 15(1): 1071, 2024 Feb 05.
Article in English | MEDLINE | ID: mdl-38316797

ABSTRACT

While significant advances have been made in predicting static protein structures, the inherent dynamics of proteins, modulated by ligands, are crucial for understanding protein function and facilitating drug discovery. Traditional docking methods, frequently used in studying protein-ligand interactions, typically treat proteins as rigid. While molecular dynamics simulations can propose appropriate protein conformations, they're computationally demanding due to rare transitions between biologically relevant equilibrium states. In this study, we present DynamicBind, a deep learning method that employs equivariant geometric diffusion networks to construct a smooth energy landscape, promoting efficient transitions between different equilibrium states. DynamicBind accurately recovers ligand-specific conformations from unbound protein structures without the need for holo-structures or extensive sampling. Remarkably, it demonstrates state-of-the-art performance in docking and virtual screening benchmarks. Our experiments reveal that DynamicBind can accommodate a wide range of large protein conformational changes and identify cryptic pockets in unseen protein targets. As a result, DynamicBind shows potential in accelerating the development of small molecules for previously undruggable targets and expanding the horizons of computational drug discovery.


Subject(s)
Molecular Dynamics Simulation , Proteins , Ligands , Proteins/metabolism , Protein Conformation , Drug Discovery , Protein Binding , Molecular Docking Simulation
14.
Patterns (N Y) ; 3(12): 100653, 2022 Dec 09.
Article in English | MEDLINE | ID: mdl-36569549

ABSTRACT

Jiahua Rao and Shuangjia Zheng are Ph.D. students in Prof. Yang's lab (Supercomputing And AI for Life science, SAIL Lab) at Sun Yat-sen University. They recently developed an interpretable framework to quantitatively assess the interpretability of Graph Neural Network (GNN) and made comparison with medicinal chemists. Their meaningful benchmarking and rigorous framework would greatly benefit development of new interpretable methods in GNNs.

15.
Patterns (N Y) ; 3(12): 100628, 2022 Dec 09.
Article in English | MEDLINE | ID: mdl-36569553

ABSTRACT

Graph neural networks (GNNs) have received increasing attention because of their expressive power on topological data, but they are still criticized for their lack of interpretability. To interpret GNN models, explainable artificial intelligence (XAI) methods have been developed. However, these methods are limited to qualitative analyses without quantitative assessments from the real-world datasets due to a lack of ground truths. In this study, we have established five XAI-specific molecular property benchmarks, including two synthetic and three experimental datasets. Through the datasets, we quantitatively assessed six XAI methods on four GNN models and made comparisons with seven medicinal chemists of different experience levels. The results demonstrated that XAI methods could deliver reliable and informative answers for medicinal chemists in identifying the key substructures. Moreover, the identified substructures were shown to complement existing classical fingerprints to improve molecular property predictions, and the improvements increased with the growth of training data.

16.
Nat Commun ; 13(1): 3342, 2022 06 10.
Article in English | MEDLINE | ID: mdl-35688826

ABSTRACT

The complete biosynthetic pathways are unknown for most natural products (NPs), it is thus valuable to make computer-aided bio-retrosynthesis predictions. Here, a navigable and user-friendly toolkit, BioNavi-NP, is developed to predict the biosynthetic pathways for both NPs and NP-like compounds. First, a single-step bio-retrosynthesis prediction model is trained using both general organic and biosynthetic reactions through end-to-end transformer neural networks. Based on this model, plausible biosynthetic pathways can be efficiently sampled through an AND-OR tree-based planning algorithm from iterative multi-step bio-retrosynthetic routes. Extensive evaluations reveal that BioNavi-NP can identify biosynthetic pathways for 90.2% of 368 test compounds and recover the reported building blocks as in the test set for 72.8%, 1.7 times more accurate than existing conventional rule-based approaches. The model is further shown to identify biologically plausible pathways for complex NPs collected from the recent literature. The toolkit as well as the curated datasets and learned models are freely available to facilitate the elucidation and reconstruction of the biosynthetic pathways for NPs.


Subject(s)
Biological Products , Deep Learning , Algorithms , Biosynthetic Pathways , Neural Networks, Computer
17.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3735-3743, 2022.
Article in English | MEDLINE | ID: mdl-34637380

ABSTRACT

MOTIVATION: The interactions of proteins with DNA, RNA, peptide, and carbohydrate play key roles in various biological processes. The studies of uncharacterized protein-molecules interactions could be aided by accurate predictions of residues that bind with partner molecules. However, the existing methods for predicting binding residues on proteins remain of relatively low accuracies due to the limited number of complex structures in databases. As different types of molecules partially share chemical mechanisms, the predictions for each molecular type should benefit from the binding information with other molecule types. RESULTS: In this study, we employed a multiple task deep learning strategy to develop a new sequence-based method for simultaneously predicting binding residues/sites with multiple important molecule types named MTDsite. By combining four training sets for DNA, RNA, peptide, and carbohydrate-binding proteins, our method yielded accurate and robust predictions with AUC values of 0.852, 0836, 0.758, and 0.776 on their respective independent test sets, which are 0.52 to 6.6% better than other state-of-the-art methods. To my best knowledge, this is the first method using multi-task framework to predict multiple molecular binding sites simultaneously.


Subject(s)
Peptides , RNA , RNA/chemistry , Peptides/chemistry , Neural Networks, Computer , Proteins/chemistry , Binding Sites , Carbohydrates , DNA/genetics , DNA/metabolism , Protein Binding
18.
Nat Biomed Eng ; 6(1): 76-93, 2022 01.
Article in English | MEDLINE | ID: mdl-34992270

ABSTRACT

A reduced removal of dysfunctional mitochondria is common to aging and age-related neurodegenerative pathologies such as Alzheimer's disease (AD). Strategies for treating such impaired mitophagy would benefit from the identification of mitophagy modulators. Here we report the combined use of unsupervised machine learning (involving vector representations of molecular structures, pharmacophore fingerprinting and conformer fingerprinting) and a cross-species approach for the screening and experimental validation of new mitophagy-inducing compounds. From a library of naturally occurring compounds, the workflow allowed us to identify 18 small molecules, and among them two potent mitophagy inducers (Kaempferol and Rhapontigenin). In nematode and rodent models of AD, we show that both mitophagy inducers increased the survival and functionality of glutamatergic and cholinergic neurons, abrogated amyloid-ß and tau pathologies, and improved the animals' memory. Our findings suggest the existence of a conserved mechanism of memory loss across the AD models, this mechanism being mediated by defective mitophagy. The computational-experimental screening and validation workflow might help uncover potent mitophagy modulators that stimulate neuronal health and brain homeostasis.


Subject(s)
Alzheimer Disease , Mitophagy , Alzheimer Disease/drug therapy , Alzheimer Disease/pathology , Amyloid beta-Peptides , Animals , Machine Learning , Mitophagy/physiology , Workflow
19.
J Cheminform ; 13(1): 7, 2021 Feb 08.
Article in English | MEDLINE | ID: mdl-33557952

ABSTRACT

Protein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. In this study, we have developed a new structure-aware method GraphSol to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps only from the sequence. GraphSol was shown to substantially outperform other sequence-based methods. The model was proven to be stable by consistent [Formula: see text] of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based protein solubility predictions. More importantly, this architecture could be easily extended to other protein prediction tasks requiring a raw protein sequence.

20.
J Cheminform ; 13(1): 87, 2021 Nov 13.
Article in English | MEDLINE | ID: mdl-34774103

ABSTRACT

Scaffold hopping is a central task of modern medicinal chemistry for rational drug design, which aims to design molecules of novel scaffolds sharing similar target biological activities toward known hit molecules. Traditionally, scaffolding hopping depends on searching databases of available compounds that can't exploit vast chemical space. In this study, we have re-formulated this task as a supervised molecule-to-molecule translation to generate hopped molecules novel in 2D structure but similar in 3D structure, as inspired by the fact that candidate compounds bind with their targets through 3D conformations. To efficiently train the model, we curated over 50 thousand pairs of molecules with increased bioactivity, similar 3D structure, but different 2D structure from public bioactivity database, which spanned 40 kinases commonly investigated by medicinal chemists. Moreover, we have designed a multimodal molecular transformer architecture by integrating molecular 3D conformer through a spatial graph neural network and protein sequence information through Transformer. The trained DeepHop model was shown able to generate around 70% molecules having improved bioactivity together with high 3D similarity but low 2D scaffold similarity to the template molecules. This ratio was 1.9 times higher than other state-of-the-art deep learning methods and rule- and virtual screening-based methods. Furthermore, we demonstrated that the model could generalize to new target proteins through fine-tuning with a small set of active compounds. Case studies have also shown the advantages and usefulness of DeepHop in practical scaffold hopping scenarios.

SELECTION OF CITATIONS
SEARCH DETAIL