Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
1.
J Chem Inf Model ; 63(15): 4497-4504, 2023 08 14.
Artigo em Inglês | MEDLINE | ID: mdl-37487018

RESUMO

Machine-learning and deep-learning models have been extensively used in cheminformatics to predict molecular properties, to reduce the need for direct measurements, and to accelerate compound prioritization. However, different setups and frameworks and the large number of molecular representations make it difficult to properly evaluate, reproduce, and compare them. Here we present a new PREdictive modeling FramEwoRk for molecular discovery (PREFER), written in Python (version 3.7.7) and based on AutoSklearn (version 0.14.7), that allows comparison between different molecular representations and common machine-learning models. We provide an overview of the design of our framework and show exemplary use cases and results of several representation-model combinations on diverse data sets, both public and in-house. Finally, we discuss the use of PREFER on small data sets. The code of the framework is freely available on GitHub.


Assuntos
Quimioinformática , Aprendizado de Máquina
2.
Bioorg Med Chem Lett ; 64: 128667, 2022 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-35276359

RESUMO

Inhibition of mutant activin A type-1 receptor ACVR1 (ALK2) signaling by small-molecule drugs is a promising therapeutic approach to treat fibrodysplasia ossificans progressiva (FOP), an ultra-rare disease leading to progressive soft tissue heterotopic ossification with no curative treatment available to date. Here, we describe the synthesis and in vitro characterization of a novel series of 2-aminopyrazine-3-carboxamides that led to the discovery of Compound 23 showing excellent biochemical and cellular potency, selectivity over other BMP and TGFß signaling receptor kinases, and a favorable in vitro ADME profile.


Assuntos
Miosite Ossificante , Ossificação Heterotópica , Receptores de Ativinas Tipo I , Humanos , Miosite Ossificante/tratamento farmacológico , Pirazinas/farmacologia , Pirazinas/uso terapêutico , Transdução de Sinais
3.
J Chem Inf Model ; 62(23): 6002-6021, 2022 Dec 12.
Artigo em Inglês | MEDLINE | ID: mdl-36351293

RESUMO

In the drug development process, optimization of properties and biological activities of small molecules is an important task to obtain drug candidates with optimal efficacy when first applied in subsequent clinical studies. However, despite its importance, large-scale investigations of the optimization process in early drug discovery are lacking, likely due to the absence of historical records of different chemical series used in past projects. Here, we report a retrospective reconstruction of ∼3000 chemical series from the Novartis compound database, which allows us to characterize the general properties of chemical series as well as the time evolution of structural properties, ADMET properties, and target activities. Our data-driven approach allows us to substantiate common MedChem knowledge. We find that size, fraction of sp3-hybridized carbon atoms (Fsp3), and the density of stereocenters tend to increase during optimization, while the aromaticity of the compounds decreases. On the ADMET side, solubility tends to increase and permeability decreases, while safety-related properties tend to improve. Importantly, while ligand efficiency decreases due to molecular growth over time, target activities and lipophilic efficiency tend to improve. This emphasizes the heavy-atom count and log D as important parameters to monitor, especially as we further show that the decrease in permeability can be explained with the increase in molecular size. We highlight overlaps, shortcomings, and differences of the computationally reconstructed chemical series compared to the series used in recent internal drug discovery projects and investigate the relation to historical projects.


Assuntos
Descoberta de Drogas , Estudos Retrospectivos , Ligantes , Solubilidade , Bases de Dados Factuais
4.
J Chem Inf Model ; 61(6): 2623-2640, 2021 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-34100609

RESUMO

Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.


Assuntos
Algoritmos , Aprendizado de Máquina
5.
J Chem Inf Model ; 60(6): 2888-2902, 2020 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-32374165

RESUMO

We investigate different automated approaches for the classification of chemical series in early drug discovery, with the aim of closely mimicking human chemical series conception. Chemical series, which are commonly defined by hand-drawn scaffolds, organize datasets in drug discovery projects. Often, they form the basis for further project decisions. To trace and evaluate these decisions in historic and ongoing projects, it is important to know or reconstruct chemical series. There is not a unique correct definition of chemical series, and the human definition certainly involves a subjective bias. Hence, we first develop quality metrics for the chemical series definitions, evaluating the size and specificity of chemical series. These metrics are applied to categorize human series definitions and implemented in automated classification approaches. For the automated classification of chemical series, we test different fragmentation and similarity-based clustering algorithms and apply different approaches to infer series definitions from these clusters or sets of fragments. We benchmark the classification results against human-defined series from 30 internal projects. The best results in reproducing the composition of human-defined series are achieved when applying UPGMA (unweighted pair group method with arithmetic mean) clustering to the project dataset and calculating maximum common substructures of the clusters as series definitions. We evaluate this approach in more detail on a public dataset and assess its robustness by 10-fold cross-validation, each time sampling 40% of the dataset. Through these benchmarking and validation experiments, we show that the proposed automated approach is able to accurately and robustly identify human-defined series, which comply with a certain, predefined level of specificity and size. Suggesting a thoroughly tested algorithm for series classification, as well as quality metrics for series and several benchmarking approaches, this work lays the foundation for further analysis of project decisions, and it offers an enhanced understanding of the properties of human-defined chemical series.


Assuntos
Algoritmos , Benchmarking , Análise por Conglomerados , Humanos
6.
J Chem Inf Model ; 60(7): 3331-3335, 2020 07 27.
Artigo em Inglês | MEDLINE | ID: mdl-32584031

RESUMO

We present an implementation of the scaffold network in the open source cheminformatics toolkit RDKit. Scaffold networks have been introduced in the literature as a powerful method to navigate and analyze large screening data sets in medicinal chemistry. Such a network can be created by iteratively applying predefined fragmentation rules to the investigated set of small molecules and by linking the produced fragments according to their descendence. This procedure results in a network graph, where the nodes correspond to the fragments and the edges correspond to the operations producing one fragment from another. In extension to the scaffold network implementations suggested in the literature, the presented implementation in RDKit allows an enhanced flexibility in terms of customizing the fragmentation rules and enables the inclusion of atom- and bond-generic scaffolds into the network. The output, providing node and edge information on the network, enables a simple and elegant navigation through the network, laying the basis to organize and better understand the data set being investigated.


Assuntos
Quimioinformática , Software , Química Farmacêutica
7.
J Chem Inf Model ; 59(4): 1347-1356, 2019 04 22.
Artigo em Inglês | MEDLINE | ID: mdl-30908913

RESUMO

Several recent reports have shown that long short-term memory generative neural networks (LSTM) of the type used for grammar learning efficiently learn to write Simplified Molecular Input Line Entry System (SMILES) of druglike compounds when trained with SMILES from a database of bioactive compounds such as ChEMBL and can later produce focused sets upon transfer learning with compounds of specific bioactivity profiles. Here we trained an LSTM using molecules taken either from ChEMBL, DrugBank, commercially available fragments, or from FDB-17 (a database of fragments up to 17 atoms) and performed transfer learning to a single known drug to obtain new analogs of this drug. We found that this approach readily generates hundreds of relevant and diverse new drug analogs and works best with training sets of around 40,000 compounds as simple as commercial fragments. These data suggest that fragment-based LSTM offer a promising method for new molecule generation.


Assuntos
Quimioinformática/métodos , Redes Neurais de Computação , Preparações Farmacêuticas/química , Modelos Moleculares , Conformação Molecular
8.
Chimia (Aarau) ; 73(12): 1001-1005, 2019 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-31883551

RESUMO

Machine Learning and Data Science have enjoyed a renaissance due to the availability of increased computational power and larger data sets. Many questions can be now asked and answered, that previously were beyond our scope. This does not translate instantly into new tools that can be used by those not skilled in the field, as many of the issues and traps still exist. In this paper, we look at some of the new tools that we have created, and some of the difficulties that still need to be taken care of during the transition from a project run by an expert, to a tool for the bench chemist.

9.
J Chem Inf Model ; 58(1): 165-181, 2018 01 22.
Artigo em Inglês | MEDLINE | ID: mdl-29172519

RESUMO

A novel alignment-free molecular descriptor called xMaP (flexible MaP descriptor) is introduced. The descriptor is the advancement of the previously published translationally and rotationally invariant three-dimensional (3D) descriptor MaP (mapping property distributions onto the molecular surface) to the fourth dimension (4D). In addition to MaP, xMaP is independent of the chosen starting conformation of the encoded molecules and is therefore entirely alignment-free. This is achieved by using ensembles of conformers, which are generated by conformational searches. This step of the procedure is similar to Hopfinger's 4D quantitative structure-activity relationship (QSAR). A five-step procedure is used to compute the xMaP descriptor. First, a conformational search for each molecule is carried out. Next, for each of the conformers an approximation to the molecular surface with equally distributed surface points is computed. Third, molecular properties are projected onto this surface. Fourth, areas of identical properties are clustered to so-called patches. Fifth, the spatial distribution of the patches is converted into an alignment-free descriptor that is based on the entire conformer ensemble. The resulting descriptor can be interpreted by superimposing the most important descriptor variables and the molecules of the data set. The most important descriptor variables are identified with chemometric regression tools. The novel descriptor was applied to several benchmark data sets and was compared to other descriptors and QSAR techniques comprising a binary fingerprint, a topological pharmacophore descriptor (Cats2D), and the field-based 3D-QSAR technique GRID/PLS which is alignment-dependent. The use of conformer ensembles renders xMaP very robust. It turns out that xMaP performs very well on (almost) all data sets and that the statistical results are comparable to GRID/PLS. In addition to that, xMaP can also be used to efficiently visualize the derived quantitative structure-activity relationships.


Assuntos
Relação Quantitativa Estrutura-Atividade , Algoritmos , Interações Hidrofóbicas e Hidrofílicas , Modelos Moleculares , Estrutura Molecular , Reprodutibilidade dos Testes , Propriedades de Superfície
10.
J Chem Inf Model ; 57(8): 1816-1831, 2017 08 28.
Artigo em Inglês | MEDLINE | ID: mdl-28715190

RESUMO

Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.


Assuntos
Mineração de Dados/métodos , Bases de Dados de Compostos Químicos , Algoritmos
12.
J Chem Inf Model ; 56(12): 2336-2346, 2016 12 27.
Artigo em Inglês | MEDLINE | ID: mdl-28024398

RESUMO

When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50 000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.


Assuntos
Descoberta de Drogas , Modelos Químicos , Software , Algoritmos , Bases de Dados de Compostos Químicos , Descoberta de Drogas/métodos , Indicadores e Reagentes/química , Patentes como Assunto
13.
J Chem Inf Model ; 55(4): 896-908, 2015 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-25816021

RESUMO

Communication of data and ideas within a medicinal chemistry project on a global as well as local level is a crucial aspect in the drug design cycle. Over a time frame of eight years, we built and optimized FOCUS, a platform to produce, visualize, and share information on various aspects of a drug discovery project such as cheminformatics, data analysis, structural information, and design. FOCUS is tightly integrated with internal services that involve-among others-data retrieval systems and in-silico models and provides easy access to automated modeling procedures such as pharmacophore searches, R-group analysis, and similarity searches. In addition, an interactive 3D editor was developed to assist users in the generation and docking of close analogues of a known lead. In this paper, we will specifically concentrate on issues we faced during development, deployment, and maintenance of the software and how we continually adapted the software in order to improve usability. We will provide usage examples to highlight the functionality as well as limitations of FOCUS at the various stages of the development process. We aim to make the discussion as independent of the software platform as possible, so that our experiences can be of more general value to the drug discovery community.


Assuntos
Química Farmacêutica/métodos , Comunicação , Simulação por Computador , Descoberta de Drogas/métodos , Biologia Computacional , Ligantes
14.
J Med Chem ; 67(2): 1544-1562, 2024 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-38175811

RESUMO

NLRP3 is a molecular sensor recognizing a wide range of danger signals. Its activation leads to the assembly of an inflammasome that allows for activation of caspase-1 and subsequent maturation of IL-1ß and IL-18, as well as cleavage of Gasdermin-d and pyroptotic cell death. The NLRP3 inflammasome has been implicated in a plethora of diseases including gout, type 2 diabetes, atherosclerosis, Alzheimer's disease, and cancer. In this publication, we describe the discovery of a novel, tricyclic, NLRP3-binding scaffold by high-throughput screening. The hit (1) could be optimized into an advanced compound NP3-562 demonstrating excellent potency in human whole blood and full inhibition of IL-1ß release in a mouse acute peritonitis model at 30 mg/kg po dose. An X-ray structure of NP3-562 bound to the NLRP3 NACHT domain revealed a unique binding mode as compared to the known sulfonylurea-based inhibitors. In addition, NP3-562 shows also a good overall development profile.


Assuntos
Diabetes Mellitus Tipo 2 , Gota , Camundongos , Animais , Humanos , Proteína 3 que Contém Domínio de Pirina da Família NLR/metabolismo , Inflamassomos/metabolismo , Diabetes Mellitus Tipo 2/metabolismo , Macrófagos/metabolismo , Interleucina-1beta/metabolismo , Caspase 1/metabolismo
16.
ACS Omega ; 8(2): 2046-2056, 2023 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-36687099

RESUMO

Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations.

17.
Nat Commun ; 14(1): 6651, 2023 10 31.
Artigo em Inglês | MEDLINE | ID: mdl-37907461

RESUMO

The lead optimization process in drug discovery campaigns is an arduous endeavour where the input of many medicinal chemists is weighed in order to reach a desired molecular property profile. Building the expertise to successfully drive such projects collaboratively is a very time-consuming process that typically spans many years within a chemist's career. In this work we aim to replicate this process by applying artificial intelligence learning-to-rank techniques on feedback that was obtained from 35 chemists at Novartis over the course of several months. We exemplify the usefulness of the learned proxies in routine tasks such as compound prioritization, motif rationalization, and biased de novo drug design. Annotated response data is provided, and developed models and code made available through a permissive open-source license.


Assuntos
Inteligência Artificial , Química Farmacêutica , Química Farmacêutica/métodos , Intuição , Descoberta de Drogas/métodos , Desenho de Fármacos , Aprendizado de Máquina
18.
J Med Chem ; 66(20): 14047-14060, 2023 10 26.
Artigo em Inglês | MEDLINE | ID: mdl-37815201

RESUMO

Early in silico assessment of the potential of a series of compounds to deliver a drug is one of the major challenges in computer-assisted drug design. The goal is to identify the right chemical series of compounds out of a large chemical space to then subsequently prioritize the molecules with the highest potential to become a drug. Although multiple approaches to assess compounds have been developed over decades, the quality of these predictors is often not good enough and compounds that agree with the respective estimates are not necessarily druglike. Here, we report a novel deep learning approach that leverages large-scale predictions of ∼100 ADMET assays to assess the potential of a compound to become a relevant drug candidate. The resulting score, which we termed bPK score, substantially outperforms previous approaches and showed strong discriminative performance on data sets where previous approaches did not.


Assuntos
Simulação por Computador
19.
J Cheminform ; 15(1): 119, 2023 Dec 11.
Artigo em Inglês | MEDLINE | ID: mdl-38082357

RESUMO

Time-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets. SIMPD uses a multi-objective genetic algorithm with objectives derived from an extensive analysis of the differences between early and late compounds in more than 130 lead-optimization projects run within the Novartis Institutes for BioMedical Research. Applying SIMPD to the real-world data sets produced training/test splits which more accurately reflect the differences in properties and machine-learning performance observed for temporal splits than other standard approaches like random or neighbor splits. We applied the SIMPD algorithm to bioactivity data extracted from ChEMBL and created 99 public data sets which can be used for validating machine-learning models intended for use in the setting of a medicinal chemistry project. The SIMPD code and simulated data sets are available under open-source/open-data licenses at github.com/rinikerlab/molecular_time_series.

20.
Mol Inform ; 41(6): e2100277, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-34964302

RESUMO

The ability to predict chemical reactivity of a molecule is highly desirable in drug discovery, both ex vivo (synthetic route planning, formulation, stability) and in vivo: metabolic reactions determine pharmacodynamics, pharmacokinetics and potential toxic effects, and early assessment of liabilities is vital to reduce attrition rates in later stages of development. Quantum mechanics offer a precise description of the interactions between electrons and orbitals in the breaking and forming of new bonds. Modern algorithms and faster computers have allowed the study of more complex systems in a punctual and accurate fashion, and answers for chemical questions around stability and reactivity can now be provided. Through machine learning, predictive models can be built out of descriptors derived from quantum mechanics and cheminformatics, even in the absence of experimental data to train on. In this article, current progress on computational reactivity prediction is reviewed: applications to problems in drug design, such as modelling of metabolism and covalent inhibition, are highlighted and unmet challenges are posed.


Assuntos
Quimioinformática , Aprendizado de Máquina , Algoritmos , Desenho de Fármacos , Descoberta de Drogas/métodos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa