Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 18 de 18
1.
J Chem Inf Model ; 64(7): 2331-2344, 2024 Apr 08.
Article En | MEDLINE | ID: mdl-37642660

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.


Benchmarking , Quantitative Structure-Activity Relationship , Biological Assay , Machine Learning
2.
J Med Chem ; 66(20): 14047-14060, 2023 10 26.
Article En | MEDLINE | ID: mdl-37815201

Early in silico assessment of the potential of a series of compounds to deliver a drug is one of the major challenges in computer-assisted drug design. The goal is to identify the right chemical series of compounds out of a large chemical space to then subsequently prioritize the molecules with the highest potential to become a drug. Although multiple approaches to assess compounds have been developed over decades, the quality of these predictors is often not good enough and compounds that agree with the respective estimates are not necessarily druglike. Here, we report a novel deep learning approach that leverages large-scale predictions of ∼100 ADMET assays to assess the potential of a compound to become a relevant drug candidate. The resulting score, which we termed bPK score, substantially outperforms previous approaches and showed strong discriminative performance on data sets where previous approaches did not.


Computer Simulation
3.
J Chem Inf Model ; 63(15): 4497-4504, 2023 08 14.
Article En | MEDLINE | ID: mdl-37487018

Machine-learning and deep-learning models have been extensively used in cheminformatics to predict molecular properties, to reduce the need for direct measurements, and to accelerate compound prioritization. However, different setups and frameworks and the large number of molecular representations make it difficult to properly evaluate, reproduce, and compare them. Here we present a new PREdictive modeling FramEwoRk for molecular discovery (PREFER), written in Python (version 3.7.7) and based on AutoSklearn (version 0.14.7), that allows comparison between different molecular representations and common machine-learning models. We provide an overview of the design of our framework and show exemplary use cases and results of several representation-model combinations on diverse data sets, both public and in-house. Finally, we discuss the use of PREFER on small data sets. The code of the framework is freely available on GitHub.


Cheminformatics , Machine Learning
4.
J Chem Inf Model ; 62(23): 6002-6021, 2022 Dec 12.
Article En | MEDLINE | ID: mdl-36351293

In the drug development process, optimization of properties and biological activities of small molecules is an important task to obtain drug candidates with optimal efficacy when first applied in subsequent clinical studies. However, despite its importance, large-scale investigations of the optimization process in early drug discovery are lacking, likely due to the absence of historical records of different chemical series used in past projects. Here, we report a retrospective reconstruction of ∼3000 chemical series from the Novartis compound database, which allows us to characterize the general properties of chemical series as well as the time evolution of structural properties, ADMET properties, and target activities. Our data-driven approach allows us to substantiate common MedChem knowledge. We find that size, fraction of sp3-hybridized carbon atoms (Fsp3), and the density of stereocenters tend to increase during optimization, while the aromaticity of the compounds decreases. On the ADMET side, solubility tends to increase and permeability decreases, while safety-related properties tend to improve. Importantly, while ligand efficiency decreases due to molecular growth over time, target activities and lipophilic efficiency tend to improve. This emphasizes the heavy-atom count and log D as important parameters to monitor, especially as we further show that the decrease in permeability can be explained with the increase in molecular size. We highlight overlaps, shortcomings, and differences of the computationally reconstructed chemical series compared to the series used in recent internal drug discovery projects and investigate the relation to historical projects.


Drug Discovery , Retrospective Studies , Ligands , Solubility , Databases, Factual
5.
J Med Chem ; 63(23): 14425-14447, 2020 12 10.
Article En | MEDLINE | ID: mdl-33140646

This article summarizes the evolution of the screening deck at the Novartis Institutes for BioMedical Research (NIBR). Historically, the screening deck was an assembly of all available compounds. In 2015, we designed a first deck to facilitate access to diverse subsets with optimized properties. We allocated the compounds as plated subsets on a 2D grid with property based ranking in one dimension and increasing structural redundancy in the other. The learnings from the 2015 screening deck were applied to the design of a next generation in 2019. We found that using traditional leadlikeness criteria (mainly MW, clogP) reduces the hit rates of attractive chemical starting points in subset screening. Consequently, the 2019 deck relies on solubility and permeability to select preferred compounds. The 2019 design also uses NIBR's experimental assay data and inferred biological activity profiles in addition to structural diversity to define redundancy across the compound sets.


Small Molecule Libraries/chemistry , Drug Design , Drug Evaluation, Preclinical/methods , High-Throughput Screening Assays/methods , Small Molecule Libraries/pharmacology
6.
J Chem Inf Model ; 60(6): 2888-2902, 2020 06 22.
Article En | MEDLINE | ID: mdl-32374165

We investigate different automated approaches for the classification of chemical series in early drug discovery, with the aim of closely mimicking human chemical series conception. Chemical series, which are commonly defined by hand-drawn scaffolds, organize datasets in drug discovery projects. Often, they form the basis for further project decisions. To trace and evaluate these decisions in historic and ongoing projects, it is important to know or reconstruct chemical series. There is not a unique correct definition of chemical series, and the human definition certainly involves a subjective bias. Hence, we first develop quality metrics for the chemical series definitions, evaluating the size and specificity of chemical series. These metrics are applied to categorize human series definitions and implemented in automated classification approaches. For the automated classification of chemical series, we test different fragmentation and similarity-based clustering algorithms and apply different approaches to infer series definitions from these clusters or sets of fragments. We benchmark the classification results against human-defined series from 30 internal projects. The best results in reproducing the composition of human-defined series are achieved when applying UPGMA (unweighted pair group method with arithmetic mean) clustering to the project dataset and calculating maximum common substructures of the clusters as series definitions. We evaluate this approach in more detail on a public dataset and assess its robustness by 10-fold cross-validation, each time sampling 40% of the dataset. Through these benchmarking and validation experiments, we show that the proposed automated approach is able to accurately and robustly identify human-defined series, which comply with a certain, predefined level of specificity and size. Suggesting a thoroughly tested algorithm for series classification, as well as quality metrics for series and several benchmarking approaches, this work lays the foundation for further analysis of project decisions, and it offers an enhanced understanding of the properties of human-defined chemical series.


Algorithms , Benchmarking , Cluster Analysis , Humans
7.
ChemMedChem ; 13(13): 1315-1324, 2018 07 06.
Article En | MEDLINE | ID: mdl-29749719

Chirality is understood by many as a binary concept: a molecule is either chiral or it is not. In terms of the action of a structure on polarized light, this is indeed true. When examined through the prism of molecular recognition, the answer becomes more nuanced. In this work, we investigated chiral behavior on protein-ligand binding: when does chirality make a difference in binding activity? Chirality is a property of the 3D structure, so recognition also requires an appreciation of the conformation. In many situations, the bioactive conformation is undefined. We set out to address this by defining and using several novel 2D descriptors to capture general characteristic features of the chiral center. Using machine-learning methods, we built different predictive models to estimate if a chiral pair (a set of two enantiomers) might exhibit a chiral cliff in a binding assay. A set of about 3800 chiral pairs extracted from the ChEMBL23 database was used to train and test our models. By achieving an accuracy of up to 75 %, our models provide good performance in discriminating chiral cliffs from non-cliffs. More importantly, we were able to derive some simple guidelines for when one can reasonably use a racemate and when an enantiopure compound is needed in an assay. We critically discuss our results and show detailed examples of using our guidelines. Along with this publication we provide our dataset, our novel descriptors, and the Python code to rebuild the predictive models.


Proteins/metabolism , Small Molecule Libraries/metabolism , Datasets as Topic/statistics & numerical data , Ligands , Machine Learning , Models, Molecular , Small Molecule Libraries/chemistry , Stereoisomerism
8.
J Chem Inf Model ; 57(8): 1816-1831, 2017 08 28.
Article En | MEDLINE | ID: mdl-28715190

Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.


Data Mining/methods , Databases, Chemical , Algorithms
9.
J Chem Inf Model ; 53(11): 2829-36, 2013 Nov 25.
Article En | MEDLINE | ID: mdl-24171408

The concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature. The heterogeneous version of classifier fusion - fusing the predictions from different model types - has been less explored. Here, we investigate heterogeneous classifier fusion for ligand-based VS using three different ML methods, RF, naïve Bayes (NB), and logistic regression (LR), with four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint, and circular fingerprint. The methods are compared using a previously developed benchmarking platform for 2D fingerprints which is extended to ML methods in this article. The original data sets are filtered for difficulty, and a new set of challenging data sets from ChEMBL is added. Data sets were also generated for a second use case: starting from a small set of related actives instead of diverse actives. The final fused model consistently outperforms the other approaches across the broad variety of targets studied, indicating that heterogeneous classifier fusion is a very promising approach for ligand-based VS. The new data sets together with the adapted source code for ML methods are provided in the Supporting Information .


Algorithms , Artificial Intelligence , Data Mining , High-Throughput Screening Assays/statistics & numerical data , Proteins/chemistry , User-Computer Interface , Bayes Theorem , Benchmarking , Databases, Chemical , Decision Making , Ligands , Logistic Models , Models, Molecular , Proteins/agonists , Proteins/antagonists & inhibitors
10.
Bioinformatics ; 29(4): 523-4, 2013 Feb 15.
Article En | MEDLINE | ID: mdl-23257198

MOTIVATION: The ChEMBLSpace graphical explorer enables the identification of compounds from the ChEMBL database, which exhibit a desirable polypharmacology profile. This profile can be predefined or created iteratively, and the tool can be extended to other data sources.


Databases, Chemical , Polypharmacy , Software , Computer Graphics , Humans , Ligands , Proteins/drug effects
11.
J Cheminform ; 3(1): 3, 2011 Jan 10.
Article En | MEDLINE | ID: mdl-21219648

BACKGROUND: The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats. RESULTS: We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al. CONCLUSIONS: jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.

12.
J Chem Inf Model ; 51(2): 203-13, 2011 Feb 28.
Article En | MEDLINE | ID: mdl-21207929

The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.


Artificial Intelligence , Computational Biology/methods , Drug Evaluation, Preclinical/methods , Structure-Activity Relationship , Databases, Factual , Models, Molecular , Molecular Conformation , Reproducibility of Results , Time Factors , User-Computer Interface
14.
J Cheminform ; 2(1): 2, 2010 Mar 11.
Article En | MEDLINE | ID: mdl-20222949

BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.

15.
Mol Inform ; 29(5): 441-55, 2010 May 17.
Article En | MEDLINE | ID: mdl-27463199

We present a new probabilistic encoding of the conformational space of a molecule that allows for the integration into common similarity calculations. The method uses distance profiles of flexible atom-pairs and computes generative models that describe the distance distribution in the conformational space. The generative models permit the use of probabilistic kernel functions and, therefore, our approach can be used to extend existing 3D molecular kernel functions, as applied in support vector machines, to build QSAR models. The resulting kernels are valid 4D kernel functions and reduce the dependency of the model quality on suitable conformations of the molecules. We showed in several experiments the robust performance of the 4D kernel function, which was extended by our approach, in comparison to the original 3D-based kernel function. The new method compares the conformational space of two molecules within one kernel evaluation. Hence, the number of kernel evaluations is significantly reduced in comparison to common kernel-based conformational space averaging techniques. Additionally, the performance gain of the extended model correlates with the flexibility of the data set and enables an a priori estimation of the model improvement.

17.
J Chem Inf Model ; 49(3): 549-60, 2009 Mar.
Article En | MEDLINE | ID: mdl-19434895

In this work, we introduce a new method to regard the geometry in a structural similarity measure by approximating the conformational space of a molecule. Our idea is to break down the molecular conformation into the local conformations of neighbor atoms with respect to core atoms. This local geometry can be implicitly accessed by the trajectories of the neighboring atoms, which are emerge by rotatable bonds. In our approach, the physicochemical atomic similarity, which can be used in structured similarity measures, is augmented by a local flexibility similarity, which gives a rough estimate of the similarity of the local conformational space. We incorporated this new type of encoding the flexibility into the optimal assignment molecular similarity approach, which can be used as a pseudokernel in support vector machines. The impact of the local flexibility was evaluated on several published QSAR data sets. This lead to an improvement of the model quality on 9 out of 10 data sets compared to the unmodified optimal assignment kernel.


Molecular Structure , Models, Molecular , Quantitative Structure-Activity Relationship
18.
J Cheminform ; 1: 14, 2009 Aug 25.
Article En | MEDLINE | ID: mdl-20150995

BACKGROUND: Ligand-based virtual screening experiments are an important task in the early drug discovery stage. An ambitious aim in each experiment is to disclose active structures based on new scaffolds. To perform these "scaffold-hoppings" for individual problems and targets, a plethora of different similarity methods based on diverse techniques were published in the last years. The optimal assignment approach on molecular graphs, a successful method in the field of quantitative structure-activity relationships, has not been tested as a ligand-based virtual screening method so far. RESULTS: We evaluated two already published and two new optimal assignment methods on various data sets. To emphasize the "scaffold-hopping" ability, we used the information of chemotype clustering analyses in our evaluation metrics. Comparisons with literature results show an improved early recognition performance and comparable results over the complete data set. A new method based on two different assignment steps shows an increased "scaffold-hopping" behavior together with a good early recognition performance. CONCLUSION: The presented methods show a good combination of chemotype discovery and enrichment of active structures. Additionally, the optimal assignment on molecular graphs has the advantage to investigate and interpret the mappings, allowing precise modifications of internal parameters of the similarity measure for specific targets. All methods have low computation times which make them applicable to screen large data sets.

...