RESUMO
Reinforcement learning (RL) has been applied to various domains in computational chemistry and has found wide-spread success. In this review, we first motivate the application of RL to chemistry and list some broad application domains, for example, molecule generation, geometry optimization, and retrosynthetic pathway search. We set up some of the formalism associated with reinforcement learning that should help the reader translate their chemistry problems into a form where RL can be used to solve them. We then discuss the solution formulations and algorithms proposed in recent literature for these problems, the advantages of one over the other, together with the necessary details of the RL algorithms they employ. This article should help the reader understand the state of RL applications in chemistry, learn about some relevant actively-researched open problems, gain insight into how RL can be used to approach them and hopefully inspire innovative RL applications in Chemistry.
RESUMO
Molecular Property Diagnostic Suite (MPDS) was conceived and developed as an open-source disease-specific web portal based on Galaxy. MPDSCOVID-19 was developed for COVID-19 as a one-stop solution for drug discovery research. Galaxy platforms enable the creation of customized workflows connecting various modules in the web server. The architecture of MPDSCOVID-19 effectively employs Galaxy v22.04 features, which are ported on CentOS 7.8 and Python 3.7. MPDSCOVID-19 provides significant updates and the addition of several new tools updated after six years. Tools developed by our group in Perl/Python and open-source tools are collated and integrated into MPDSCOVID-19 using XML scripts. Our MPDS suite aims to facilitate transparent and open innovation. This approach significantly helps bring inclusiveness in the community while promoting free access and participation in software development. Availability & Implementation: The MPDSCOVID-19 portal can be accessed at https://mpds.neist.res.in:8085/.
RESUMO
Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski's rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.
Assuntos
Ligantes , Proteínas , Descoberta de Drogas , Aprendizado de Máquina , Ligação Proteica , Proteínas/química , Humanos , AnimaisRESUMO
The discovery of potential therapeutic agents for life-threatening diseases has become a significant problem. There is a requirement for fast and accurate methods to identify drug-like molecules that can be used as potential candidates for novel targets. Existing techniques like high-throughput screening and virtual screening are time-consuming and inefficient. Traditional molecule generation pipelines are more efficient than virtual screening but use time-consuming docking software. Such docking functions can be emulated using Machine Learning models with comparable accuracy and faster execution times. However, we find that when pre-trained machine learning models are employed in generative pipelines as oracles, they suffer from model degradation in areas where data is scarce. In this study, we propose an active learning-based model that can be added as a supplement to enhanced molecule generation architectures. The proposed method uses uncertainty sampling on the molecules created by the generator model and dynamically learns as the generator samples molecules from different regions of the chemical space. The proposed framework can generate molecules with high binding affinity with [Formula: see text]a 70% improvement in runtime compared to the baseline model by labeling only [Formula: see text]30% of molecules compared to the baseline oracle.
Assuntos
Ensaios de Triagem em Larga Escala , SoftwareRESUMO
Most optimization problems require the user to select an algorithm and, to some extent, also tune it for better performance. Although intuition and knowledge about the problem can speed up these selection and fine-tuning processes, users often use trial-and-error methodologies, which can be time-consuming and inefficient. With all of that in mind and much more, the concept of "learned optimizers", "learning to learn", and "meta-learning" has been gathering attention in recent years. In this article, we propose MolOpt that uses multiagent reinforcement learning (MARL) for autonomous molecular geometry optimization (MGO). Typically MGO algorithms are hand-designed, but MolOpt uses MARL to learn a learned optimizer (policy) that can perform MGO without the need for other hand-designed optimizers. We cast MGO as a MARL problem, where each agent corresponds to a single atom in the molecule. MolOpt performs MGO by minimizing the forces on each atom of the molecule. Our experiments demonstrate the generalizing ability of MolOpt for the MGO of propane, pentane, heptane, hexane, and octane when trained on ethane, butane, and isobutane. In terms of performance, MolOpt outperforms the MDMin optimizer and demonstrates performance similar to that of the FIRE optimizer. However, it does not surpass the BFGS optimizer. The results demonstrate that MolOpt has the potential to introduce innovative advancements in MGO by providing a novel approach using reinforcement learning (RL), which may open up new research directions for MGO. Overall, this work serves as a proof-of-concept for the potential of MARL in MGO.
RESUMO
Coronavirus, a zoonotic virus capable of transmitting infections from animals to humans, emerged as a pandemic recently. In such circumstances, it is essential to understand the virus's origin. In this study, we present a novel machine-learning pipeline PreHost for host prediction of the family, Coronaviridae. We leverage the complete viral genome and sequences at the protein level (spike protein, membrane protein, and nucleocapsid protein). Compared with the current state-of-the-art approaches, the random forest model attained high accuracy and recall scores of 99.91% and 0.98, respectively, for genome sequences. In addition to the spike protein sequences, our study shows membrane and nucleocapsid protein sequences can be utilized to predict the host of viruses. We also identified important sites in the viral sequences that help distinguish between different host classes. The host prediction pipeline PreHost will cater as a valuable tool to take effective measures to govern the transmission of future viruses.
RESUMO
Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein-ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein-ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein-ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.
RESUMO
Herein, a new type of carbodicarbene (CDC) comprising two different classes of carbenes is reported; NHC and CAAC as donor substituents and compare the molecular structure and coordination to Au(I)Cl to those of NHC-only and CAAC-only analogues. The conjugate acids of these three CDCs exhibit notable redox properties. Their reactions with [NO][SbF6 ] were investigated. The reduction of the conjugate acid of CAAC-only based CDC with KC8 results in the formation of hydrogen abstracted/eliminated products, which proceed through a neutral radical intermediate, detected by EPR spectroscopy. In contrast, the reduction of conjugate acids of NHC-only and NHC/CAAC based CDCs led to intermolecular reductive (reversible) carbon-carbon sigma bond formation. The resulting relatively elongated carbon-carbon sigma bonds were found to be readily oxidized. They were, thus, demonstrated to be potent reducing agents, underlining their potential utility as organic electron donors and n-dopants in organic semiconductor molecules.
RESUMO
The pursuit of potential inhibitors for novel targets has become a very important problem especially over the last 2 years with the world in the midst of the COVID-19 pandemic. This entails performing high throughput screening exercises on drug libraries to identify potential "hits". These hits are identified using analysis of their physical properties like binding affinity to the target receptor, octanol-water partition coefficient (LogP) and more. However, drug libraries can be extremely large and it is infeasible to calculate and analyze the physical properties for each of those molecules within acceptable time and moreover, each molecule must possess a multitude of properties apart from just the binding affinity. To address this problem, in this study, we propose an extension to the Machine learning framework for Enhanced MolEcular Screening (MEMES) framework for multi-objective Bayesian optimization. This approach is capable of identifying over 90% of the most desirable molecules with respect to all required properties while explicitly calculating the values of each of those properties on only 6% of the entire drug library. This framework would provide an immense boost in identifying potential hits that possess all properties required for a drug molecules.
RESUMO
A unique B-N coordinated phenanthroimidazole-based zinc salen was synthesized. The zinc salen thus synthesized acts as a photocatalyst for the cycloaddition of carbon dioxide with terminal epoxides under ambient conditions. DFT study of the cycloaddition of carbon dioxide with terminal epoxide indicates the preference of the reaction pathway when photocatalyzed by zinc salen. We anticipate that this strategy will help to design new photocatalysts for CO2 fixation.
RESUMO
Computational methods and recently modern machine learning methods have played a key role in structure-based drug design. Though several benchmarking datasets are available for machine learning applications in virtual screening, accurate prediction of binding affinity for a protein-ligand complex remains a major challenge. New datasets that allow for the development of models for predicting binding affinities better than the state-of-the-art scoring functions are important. For the first time, we have developed a dataset, PLAS-5k comprised of 5000 protein-ligand complexes chosen from PDB database. The dataset consists of binding affinities along with energy components like electrostatic, van der Waals, polar and non-polar solvation energy calculated from molecular dynamics simulations using MMPBSA (Molecular Mechanics Poisson-Boltzmann Surface Area) method. The calculated binding affinities outperformed docking scores and showed a good correlation with the available experimental values. The availability of energy components may enable optimization of desired components during machine learning-based drug design. Further, OnionNet model has been retrained on PLAS-5k dataset and is provided as a baseline for the prediction of binding affinities.
Assuntos
Simulação de Dinâmica Molecular , Proteínas , Animais , Humanos , Ligantes , Aprendizado de Máquina , Ligação Proteica , Proteínas/químicaRESUMO
Nucleocytoplasmic shuttling of viral elements, supported by several host factors, is essential for the replication of the human immunodeficiency virus (HIV). HIV-1 uses a nuclear RNA export pathway mediated by viral protein Rev to transport its Rev response element (RRE)-containing partially spliced and unspliced transcripts aided by the host nuclear RNA export protein CRM1. The factor(s) interacting with the CRM1-Rev complex are potential antiretroviral target(s) and could serve as a retroviral model system to study nuclear export machinery adapted by these viruses. We earlier reported that cellular Staufen-2 interacts with Rev, facilitating viral-RNA export. Here, we identified the formation of a complex between Staufen-2, CRM1 and Rev. Molecular docking and simulations mapped the interacting residues in the RNA-binding Domain 4 of Staufen-2 as R336 and R337, which were experimentally verified to be critical for interactions among Staufen-2, CRM1 and Rev by mutational analysis. Staufen-2 mutants defective in interaction with CRM1 or Rev failed to supplement the Rev-RNA export activity and viral production, demonstrating the importance of these interactions. Rev-dependent reporter assays and proviral DNA-construct transfection-based studies in Staufen-2 knockout cells in the presence of leptomycin-B (LMB) revealed a significant reduction in CRM1-mediated Rev-dependent RNA export with decreased virus production as compared to Staufen-2 knockout background or LMB treatment alone, suggesting the relevance of these interactions in augmenting RNA export activity of Rev. Our observations provide further insights into the mechanistic intricacies of unspliced viral-RNA export to the cytoplasm and support the notion that abrogating such interactions can reduce HIV-1 proliferation.
Assuntos
HIV-1 , Humanos , Transporte Ativo do Núcleo Celular , Núcleo Celular/metabolismo , Genômica , HIV-1/fisiologia , Carioferinas/genética , Carioferinas/metabolismo , Simulação de Acoplamento Molecular , Proteínas Nucleares/genética , Receptores Citoplasmáticos e Nucleares/genética , Receptores Citoplasmáticos e Nucleares/metabolismo , Produtos do Gene rev do Vírus da Imunodeficiência Humana/genética , Produtos do Gene rev do Vírus da Imunodeficiência Humana/metabolismo , RNA Nuclear/metabolismo , RNA Viral/genética , RNA Viral/metabolismo , Proteínas de Ligação a RNA/metabolismoRESUMO
Spectroscopy is the study of how matter interacts with electromagnetic radiation. The spectra of any molecule are highly information-rich, yet the inverse relation of spectra to the corresponding molecular structure is still an unsolved problem. Nuclear magnetic resonance (NMR) spectroscopy is one such critical technique in the scientists' toolkit to characterize molecules. In this work, a novel machine learning framework is proposed that attempts to solve this inverse problem by navigating the chemical space to find the correct structure given an NMR spectra. The proposed framework uses a combination of online Monte Carlo tree search (MCTS) and a set of graph convolution networks to build a molecule iteratively. Our method can predict the structure of the molecule â¼80% of the time in its top 3 guesses for molecules with <10 heavy atoms. We believe that the proposed framework is a significant step in solving the inverse design problem of NMR spectra.
RESUMO
Protein-drug interactions play important roles in many biological processes and therapeutics. Predicting the binding sites of a protein helps to discover such interactions. New drugs can be designed to optimize these interactions, improving protein function. The tertiary structure of a protein decides the binding sites available to the drug molecule, but the determination of the 3D structure is slow and expensive. Conversely, the determination of the amino acid sequence is swift and economical. Although quick and accurate prediction of the binding site using just the sequence is challenging, the application of Deep Learning, which has been hugely successful in several biochemical tasks, makes it feasible. BiRDS is a Residual Neural Network that predicts the protein's most active binding site using sequence information. SC-PDB, an annotated database of druggable binding sites, is used for training the network. Multiple Sequence Alignments of the proteins in the database are generated using DeepMSA, and features such as Position-Specific Scoring Matrix, Secondary Structure, and Relative Solvent Accessibility are extracted. During training, a weighted binary cross-entropy loss function is used to counter the substantial imbalance in the two classes of binding and nonbinding residues. A novel test set SC6K is introduced to compare binding-site prediction methods. BiRDS achieves an AUROC score of 0.87, and the center of 25% of its predicted binding sites lie within 4 Å of the center of the actual binding site.
Assuntos
Aves , Proteínas , Sequência de Aminoácidos , Animais , Sítios de Ligação , Aves/metabolismo , Ligação Proteica , Estrutura Secundária de Proteína , Proteínas/química , Alinhamento de SequênciaRESUMO
The discovery of new molecules and materials helps expand the horizons of novel and innovative real-life applications. In pursuit of finding molecules with desired properties, chemists have traditionally relied on experimentation and recently on combinatorial methods to generate new substances often complimented by computational methods. The sheer size of the chemical space makes it infeasible to search through all possible molecules exhaustively. This calls for fast and efficient methods to navigate the chemical space to find substances with desired properties. This class of problems is referred to as inverse design problems. There are a variety of inverse problems in chemistry encompassing various subfields like drug discovery, retrosynthesis, structure identification, etc. Recent developments in modern machine learning (ML) methods have shown great promise in tackling problems of this kind. This has helped in making major strides in all key phases of molecule discovery ranging from in silico candidate generation to their synthesis with a focus on small organic molecules. Optimization techniques like Bayesian optimization, reinforcement learning, attention-based transformers, deep generative models like variational autoencoders and generative adversarial networks form a robust arsenal of methods. This highlight summarizes the development of deep learning to tackle a wide variety of inverse design problems in chemistry towards the quest for synthesizing small organic compounds with a purpose.
Assuntos
Aprendizado Profundo , Teorema de Bayes , Desenho de Fármacos , Descoberta de Drogas/métodos , Aprendizado de MáquinaRESUMO
The current global health emergency in the form of the Coronavirus 2019 (COVID-19) pandemic has highlighted the need for fast, accurate, and efficient drug discovery pipelines. Traditional drug discovery projects relying on in vitro high-throughput screening (HTS) involve large investments and sophisticated experimental set-ups, affordable only to big biopharmaceutical companies. In this scenario, application of efficient state-of-the-art computational methods and modern artificial intelligence (AI)-based algorithms for rapid screening of repurposable chemical space [approved drugs and natural products (NPs) with proven pharmacokinetic profiles] to identify the initial leads is a powerful option to save resources and time. Structure-based drug repurposing is a popular in silico repurposing approach. In this review, we discuss traditional and modern AI-based computational methods and tools applied at various stages for structure-based drug discovery (SBDD) pipelines. Additionally, we highlight the role of generative models in generating molecules with scaffolds from repurposable chemical space.
Assuntos
Tratamento Farmacológico da COVID-19 , Reposicionamento de Medicamentos , Inteligência Artificial , Descoberta de Drogas , Humanos , PandemiasRESUMO
The variability of clinical course and prognosis of COVID-19 highlights the necessity of patient sub-group risk stratification based on clinical data. In this study, clinical data from a cohort of Indian COVID-19 hospitalized patients is used to develop risk stratification and mortality prediction models. We analyzed a set of 70 clinical parameters including physiological and hematological for developing machine learning models to identify biomarkers. We also compared the Indian and Wuhan cohort, and analyzed the role of steroids. A bootstrap averaged ensemble of Bayesian networks was also learned to construct an explainable model for discovering actionable influences on mortality and days to outcome. We discovered blood parameters, diabetes, co-morbidity and SpO2 levels as important risk stratification features, whereas mortality prediction is dependent only on blood parameters. XGboost and logistic regression model yielded the best performance on risk stratification and mortality prediction, respectively (AUC score 0.83, AUC score 0.92). Blood coagulation parameters (ferritin, D-Dimer and INR), immune and inflammation parameters IL6, LDH and Neutrophil (%) are common features for both risk and mortality prediction. Compared with Wuhan patients, Indian patients with extreme blood parameters indicated higher survival rate. Analyses of medications suggest that a higher proportion of survivors and mild patients who were administered steroids had extreme neutrophil and lymphocyte percentages. The ensemble averaged Bayesian network structure revealed serum ferritin to be the most important predictor for mortality and Vitamin D to influence severity independent of days to outcome. The findings are important for effective triage during strains on healthcare infrastructure.
Assuntos
COVID-19/mortalidade , Hospitalização/estatística & dados numéricos , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Teorema de Bayes , COVID-19/epidemiologia , COVID-19/etiologia , Criança , China/epidemiologia , Feminino , Humanos , Índia/epidemiologia , Aprendizado de Máquina , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Medição de Risco/métodos , Fatores de Risco , Adulto JovemRESUMO
Pattern mining from graph transactional data (GTD) is an active area of research with applications in the domains of bioinformatics, chemical informatics and social networks. Existing works address the problem of mining frequent subgraphs from GTD. However, the knowledge concerning the coverage aspect of a set of subgraphs is also valuable for improving the performance of several applications. In this regard, we introduce the notion of subgraph coverage patterns (SCPs). Given a GTD, a subgraph coverage pattern is a set of subgraphs subject to relative frequency, coverage and overlap constraints provided by the user. We propose the Subgraph ID-based Flat Transactional (SIFT) framework for the efficient extraction of SCPs from a given GTD. Our performance evaluation using three real datasets demonstrates that our proposed SIFT framework is indeed capable of efficiently extracting SCPs from GTD. Furthermore, we demonstrate the effectiveness of SIFT through a case study in computer-aided drug design.
RESUMO
There has been tremendous advancement in machine learning (ML) applications in computational chemistry, particularly in neural network potentials (NNP). NNPs can approximate potential energy surface (PES) as a high dimensional function by learning from existing reference data, thereby circumventing the need to solve the electronic Schrödinger equation explicitly. As a result, ML accelerates chemical space exploration and property prediction compared to quantum mechanical methods. Novel ML methods have the potential to provide efficient means for predicting the properties of molecules. However, this potential has been limited by the lack of standard comparative evaluations. In this work, we compare four selected models, that is, ANI, PhysNet, SchNet, and BAND-NN, developed to represent the PES of small organic molecules. We evaluate these models for their accuracy and transferability on two different test sets (i) Small organic molecules of up to eight-heavy atoms on which ANI and SchNet achieve root mean square error (RMSE) of 0.55 and 0.60 kcal/mol, respectively. (ii) On random selection of molecules from the GDB-11 database with 10-heavy atoms, ANI achieves RMSE of 1.17 kcal/mol and SchNet achieves RMSE of 1.89 kcal/mol. We examine their ability to produce smooth meaningful surface by performing PES scans for bond stretch, angle bend, and dihedral rotations on relatively large molecules to assess their possible application in molecular dynamics simulations. We also evaluate their performance for yielding minimum energy structures via geometry optimization using various minimization algorithms. All these models were also able to accurately differentiate different isomers of the same empirical formula C10H20 . ANI and PhysNet achieve an RMSE of 0.29 and 0.52 kcal/mol, respectively, on C10H20 isomers.