RESUMEN
ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL's content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for â¼270 000 bioactivity measurements.
Asunto(s)
Descubrimiento de Drogas , Bases de Datos Factuales , Factores de TiempoRESUMEN
As part of the ongoing quest to find or construct large data sets for use in validating new machine learning (ML) approaches for bioactivity prediction, it has become distressingly common for researchers to combine literature IC50 data generated using different assays into a single data set. It is well-known that there are many situations where this is a scientifically risky thing to do, even when the assays are against exactly the same target, but the risks of assays being incompatible are even higher when pulling data from large collections of literature data like ChEMBL. Here, we estimate the amount of noise present in combined data sets using cases where measurements for the same compound are reported in multiple assays against the same target. This approach shows that IC50 assays selected using minimal curation settings have poor agreement with each other: almost 65% of the points differ by more than 0.3 log units, 27% differ by more than one log unit, and the correlation between the assays, as measured by Kendall's τ, is only 0.51. Requiring that most of the assay metadata in ChEMBL matches ("maximal curation") in order to combine two assays improves the situation (48% of the points differ by more than 0.3 log units, 13% by more than one log unit, and Kendall's τ is 0.71) at the expense of having smaller data sets. Surprisingly, our analysis shows similar amounts of noise when combining data from different literature Ki assays. We suggest that good scientific practice requires careful curation when combining data sets from different assays and hope that our maximal curation strategy will help to improve the quality of the data that are being used to build and validate ML models for bioactivity prediction. To help achieve this, the code and ChEMBL queries that we used for the maximal curation approach are available as open-source software in our GitHub repository, https://github.com/rinikerlab/overlapping_assays.
Asunto(s)
Aprendizaje Automático , Programas Informáticos , BioensayoRESUMEN
Molecular flexibility is a commonly used, but not easily quantified term. It is at the core of understanding composition and size of a conformational ensemble and contributes to many molecular properties. For many computational workflows, it is necessary to reduce a conformational ensemble to meaningful representatives, however defining them and guaranteeing the ensemble's completeness is difficult. We introduce the concepts of torsion angular bin strings (TABS) as a discrete vector representation of a conformer's dihedral angles and the number of possible TABS (nTABS) as an estimation for the ensemble size of a molecule, respectively. Here, we show that nTABS corresponds to an upper limit for the size of the conformational space of small molecules and compare the classification of conformer ensembles by TABS with classifications by RMSD. Overcoming known drawbacks like the molecular size dependency and threshold picking of the RMSD measure, TABS is shown to meaningfully discretize the conformational space and hence allows e.g. for fast checks of the coverage of the conformational space. The current proof-of-concept implementation is based on the ETKDGv3 conformer generator as implemented in the RDKit and known torsion preferences extracted from small-molecule crystallographic data.
Asunto(s)
Conformación Molecular , Modelos MolecularesRESUMEN
Here, we present lwreg, a lightweight, yet flexible chemical registration system supporting the capture of both two-dimensional molecular structures (topologies) and three-dimensional conformers. lwreg is open source, with a simple Python API, and is designed to be easily integrated into computational workflows. In addition to lwreg itself, we also introduce a straightforward schema for storing experimental data and metadata in the registration database. This direct connection between compound structural information and data generated using those structures creates a powerful tool for data analysis and experimental reproducibility. The software is available at and installable directly from https://github.com/rinikerlab/lightweight-registration.
Asunto(s)
Almacenamiento y Recuperación de la Información , Programas Informáticos , Bases de Datos de Compuestos Químicos , Conformación MolecularRESUMEN
Recently, we presented a method to assign atomic partial charges based on the DASH (dynamic attention-based substructure hierarchy) tree with high efficiency and quantum mechanical (QM)-like accuracy. In addition, the approach can be considered "rule based"-where the rules are derived from the attention values of a graph neural network-and thus, each assignment is fully explainable by visualizing the underlying molecular substructures. In this work, we demonstrate that these hierarchically sorted substructures capture the key features of the local environment of an atom and allow us to predict different atomic properties with high accuracy without building a new DASH tree for each property. The fast prediction of atomic properties in molecules with the DASH tree can, for example, be used as an efficient way to generate feature vectors for machine learning without the need for expensive QM calculations. The final DASH tree with the different atomic properties as well as the complete dataset with wave functions is made freely available.
RESUMEN
We present a robust and computationally efficient approach for assigning partial charges of atoms in molecules. The method is based on a hierarchical tree constructed from attention values extracted from a graph neural network (GNN), which was trained to predict atomic partial charges from accurate quantum-mechanical (QM) calculations. The resulting dynamic attention-based substructure hierarchy (DASH) approach provides fast assignment of partial charges with the same accuracy as the GNN itself, is software-independent, and can easily be integrated in existing parametrization pipelines, as shown for the Open force field (OpenFF). The implementation of the DASH workflow, the final DASH tree, and the training set are available as open source/open data from public repositories.
RESUMEN
Nuclear magnetic resonance (NMR) data from NOESY (nuclear Overhauser enhancement spectroscopy) and ROESY (rotating frame Overhauser enhancement spectroscopy) experiments can easily be combined with distance geometry (DG) based conformer generators by modifying the molecular distance bounds matrix. In this work, we extend the modern DG based conformer generator ETKDG, which has been shown to reproduce experimental crystal structures from small molecules to large macrocycles well, to include NOE-derived interproton distances. In noeETKDG, the experimentally derived interproton distances are incorporated into the distance bounds matrix as loose upper (or lower) bounds to generate large conformer sets. Various subselection techniques can subsequently be applied to yield a conformer bundle that best reproduces the NOE data. The approach is benchmarked using a set of 24 (mostly) cyclic peptides for which NOE-derived distances as well as reference solution structures obtained by other software are available. With respect to other packages currently available, the advantages of noeETKDG are its speed and that no prior force-field parametrization is required, which is especially useful for peptides with unnatural amino acids. The resulting conformer bundles can be further processed with the use of structural refinement techniques to improve the modeling of the intramolecular nonbonded interactions. The noeETKDG code is released as a fully open-source software package available at www.github.com/rinikerlab/customETKDG.
Asunto(s)
Péptidos Cíclicos , Péptidos , Imagen por Resonancia Magnética , Espectroscopía de Resonancia Magnética/métodos , Modelos Moleculares , Conformación ProteicaRESUMEN
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Asunto(s)
Algoritmos , Aprendizaje AutomáticoRESUMEN
We present an implementation of the scaffold network in the open source cheminformatics toolkit RDKit. Scaffold networks have been introduced in the literature as a powerful method to navigate and analyze large screening data sets in medicinal chemistry. Such a network can be created by iteratively applying predefined fragmentation rules to the investigated set of small molecules and by linking the produced fragments according to their descendence. This procedure results in a network graph, where the nodes correspond to the fragments and the edges correspond to the operations producing one fragment from another. In extension to the scaffold network implementations suggested in the literature, the presented implementation in RDKit allows an enhanced flexibility in terms of customizing the fragmentation rules and enables the inclusion of atom- and bond-generic scaffolds into the network. The output, providing node and edge information on the network, enables a simple and elegant navigation through the network, laying the basis to organize and better understand the data set being investigated.
Asunto(s)
Quimioinformática , Programas Informáticos , Química FarmacéuticaRESUMEN
The conformer generator ETKDG is a stochastic search method that utilizes distance geometry together with knowledge derived from experimental crystal structures. It has been shown to generate good conformers for acyclic, flexible molecules. This work builds on ETKDG to improve conformer generation of molecules containing small or large aliphatic (i.e., non-aromatic) rings. For one, we devise additional torsional-angle potentials to describe small aliphatic rings and adapt the previously developed potentials for acyclic bonds to facilitate the sampling of macrocycles. However, due to the larger number of degrees of freedom of macrocycles, the conformational space to sample is much broader than for small molecules, creating a challenge for conformer generators. We therefore introduce different heuristics to restrict the search space of macrocycles and bias the sampling toward more experimentally relevant structures. Specifically, we show the usage of elliptical geometry and customizable Coulombic interactions as heuristics. The performance of the improved ETKDG is demonstrated on test sets of diverse macrocycles and cyclic peptides. The code developed here will be incorporated into the 2020.03 release of the open-source cheminformatics library RDKit.
Asunto(s)
Heurística , Péptidos Cíclicos , Modelos Moleculares , Conformación MolecularRESUMEN
Open-source workflows have become more and more an integral part of computer-aided drug design (CADD) projects since they allow reproducible and shareable research that can be easily transferred to other projects. Setting up, understanding, and applying such workflows involves either coding or using workflow managers that offer a graphical user interface. We previously reported the TeachOpenCADD teaching platform that provides interactive Jupyter Notebooks (talktorials) on central CADD topics using open-source data and Python packages. Here we present the conversion of these talktorials to KNIME workflows that allow users to explore our teaching material without any line of code. TeachOpenCADD KNIME workflows are freely available on the KNIME Hub: https://hub.knime.com/volkamerlab/space/TeachOpenCADD .
Asunto(s)
Diseño de Fármacos , Modelos Químicos , Programas Informáticos , Flujo de Trabajo , Simulación por ComputadorRESUMEN
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.
Asunto(s)
Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , AlgoritmosRESUMEN
When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50â¯000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.
Asunto(s)
Descubrimiento de Drogas , Modelos Químicos , Programas Informáticos , Algoritmos , Bases de Datos de Compuestos Químicos , Descubrimiento de Drogas/métodos , Indicadores y Reactivos/química , Patentes como AsuntoRESUMEN
Small organic molecules are often flexible, i.e., they can adopt a variety of low-energy conformations in solution that exist in equilibrium with each other. Two main search strategies are used to generate representative conformational ensembles for molecules: systematic and stochastic. In the first approach, each rotatable bond is sampled systematically in discrete intervals, limiting its use to molecules with a small number of rotatable bonds. Stochastic methods, on the other hand, sample the conformational space of a molecule randomly and can thus be applied to more flexible molecules. Different methods employ different degrees of experimental data for conformer generation. So-called knowledge-based methods use predefined libraries of torsional angles and ring conformations. In the distance geometry approach, on the other hand, a smaller amount of empirical information is used, i.e., ideal bond lengths, ideal bond angles, and a few ideal torsional angles. Distance geometry is a computationally fast method to generate conformers, but it has the downside that purely distance-based constraints tend to lead to distorted aromatic rings and sp(2) centers. To correct this, the resulting conformations are often minimized with a force field, adding computational complexity and run time. Here we present an alternative strategy that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data. The torsional angles are described by a previously developed set of hierarchically structured SMARTS patterns. The new approach is implemented in the open-source cheminformatics library RDKit, and its performance is assessed by comparing the diversity of the generated ensemble and the ability to reproduce crystal conformations taken from the crystal structures of small molecules and protein-ligand complexes.
Asunto(s)
Algoritmos , Modelos Moleculares , Procesos Estocásticos , Conformación Molecular , Compuestos Orgánicos/químicaRESUMEN
Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins. Furthermore, each cheminformatics toolkit or software provides its own version of a canonical ordering, most based on unpublished algorithms, which also complicates the generation of a universal unique identifier for molecules. We present an alternative canonicalization approach that uses a standard stable-sorting algorithm instead of a Morgan-like index. Two new invariants that allow canonical ordering of molecules with dependent chirality as well as those with highly symmetrical cyclic graphs have been developed. The new approach proved to be robust and fast when tested on the 1.45 million compounds of the ChEMBL 20 data set in different scenarios like random renumbering of input atoms or SMILES round tripping. Our new algorithm is able to generate a canonical order of the atoms of protein molecules within a few milliseconds. The novel algorithm is implemented in the open-source cheminformatics toolkit RDKit. With this paper, we provide a reference Python implementation of the algorithm that could easily be integrated in any cheminformatics toolkit. This provides a first step toward a common standard for canonical atom ordering to generate a universal unique identifier for molecules other than InChI.
Asunto(s)
Algoritmos , Modelos Moleculares , Bibliotecas de Moléculas Pequeñas/química , Programas Informáticos , EstereoisomerismoRESUMEN
Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information.
Asunto(s)
Inteligencia Artificial , Bases de Datos de Compuestos Químicos , Modelos Químicos , Análisis por Conglomerados , Fenómenos Químicos Orgánicos , Patentes como Asunto , Reproducibilidad de los ResultadosRESUMEN
Modern high-throughput screening (HTS) is a well-established approach for hit finding in drug discovery that is routinely employed in the pharmaceutical industry to screen more than a million compounds within a few weeks. However, as the industry shifts to more disease-relevant but more complex phenotypic screens, the focus has moved to piloting smaller but smarter chemically/biologically diverse subsets followed by an expansion around hit compounds. One standard method for doing this is to train a machine-learning (ML) model with the chemical fingerprints of the tested subset of molecules and then select the next compounds based on the predictions of this model. An alternative approach would be to take advantage of the wealth of bioactivity information contained in older (full-deck) screens using so-called HTS fingerprints, where each element of the fingerprint corresponds to the outcome of a particular assay, as input to machine-learning algorithms. We constructed HTS fingerprints using two collections of data: 93 in-house assays and 95 publicly available assays from PubChem. For each source, an additional set of 51 and 46 assays, respectively, was collected for testing. Three different ML methods, random forest (RF), logistic regression (LR), and naïve Bayes (NB), were investigated for both the HTS fingerprint and a chemical fingerprint, Morgan2. RF was found to be best suited for learning from HTS fingerprints yielding area under the receiver operating characteristic curve (AUC) values >0.8 for 78% of the internal assays and enrichment factors at 5% (EF(5%)) >10 for 55% of the assays. The RF(HTS-fp) generally outperformed the LR trained with Morgan2, which was the best ML method for the chemical fingerprint, for the majority of assays. In addition, HTS fingerprints were found to retrieve more diverse chemotypes. Combining the two models through heterogeneous classifier fusion led to a similar or better performance than the best individual model for all assays. Further validation using a pair of in-house assays and data from a confirmatory screen--including a prospective set of around 2000 compounds selected based on our approach--confirmed the good performance. Thus, the combination of machine-learning with HTS fingerprints and chemical fingerprints utilizes information from both domains and presents a very promising approach for hit expansion, leading to more hits. The source code used with the public data is provided.
Asunto(s)
Ensayos Analíticos de Alto Rendimiento/métodos , Informática/métodos , Algoritmos , Inteligencia Artificial , Teorema de Bayes , Modelos LogísticosRESUMEN
The concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature. The heterogeneous version of classifier fusion - fusing the predictions from different model types - has been less explored. Here, we investigate heterogeneous classifier fusion for ligand-based VS using three different ML methods, RF, naïve Bayes (NB), and logistic regression (LR), with four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint, and circular fingerprint. The methods are compared using a previously developed benchmarking platform for 2D fingerprints which is extended to ML methods in this article. The original data sets are filtered for difficulty, and a new set of challenging data sets from ChEMBL is added. Data sets were also generated for a second use case: starting from a small set of related actives instead of diverse actives. The final fused model consistently outperforms the other approaches across the broad variety of targets studied, indicating that heterogeneous classifier fusion is a very promising approach for ligand-based VS. The new data sets together with the adapted source code for ML methods are provided in the Supporting Information .
Asunto(s)
Algoritmos , Inteligencia Artificial , Minería de Datos , Ensayos Analíticos de Alto Rendimiento/estadística & datos numéricos , Proteínas/química , Interfaz Usuario-Computador , Teorema de Bayes , Benchmarking , Bases de Datos de Compuestos Químicos , Toma de Decisiones , Ligandos , Modelos Logísticos , Modelos Moleculares , Proteínas/agonistas , Proteínas/antagonistas & inhibidoresRESUMEN
Time-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets. SIMPD uses a multi-objective genetic algorithm with objectives derived from an extensive analysis of the differences between early and late compounds in more than 130 lead-optimization projects run within the Novartis Institutes for BioMedical Research. Applying SIMPD to the real-world data sets produced training/test splits which more accurately reflect the differences in properties and machine-learning performance observed for temporal splits than other standard approaches like random or neighbor splits. We applied the SIMPD algorithm to bioactivity data extracted from ChEMBL and created 99 public data sets which can be used for validating machine-learning models intended for use in the setting of a medicinal chemistry project. The SIMPD code and simulated data sets are available under open-source/open-data licenses at github.com/rinikerlab/molecular_time_series.
RESUMEN
Several efficient correspondence graph-based algorithms for determining the maximum common substructure (MCS) of a pair of molecules have been published in the literature. The extension of the problem to three or more molecules is however nontrivial; heuristics used to increase the efficiency in the two-molecule case are either inapplicable to the many-molecule case or do not provide significant speedups. Our specific algorithmic contribution is two-fold. First, we show how the correspondence graph approach for the two-molecule case can be generalized to obtain an algorithm that is guaranteed to find the optimum connected MCS of multiple molecules, and that runs fast on most families of molecules using a new divide-and-conquer strategy that has hitherto not been reported in this context. Second, we provide a characterization of those compound families for which the algorithm might run slowly, along with a heuristic for speeding up computations on these families. We also extend the above algorithm to a heuristic algorithm to find the disconnected MCS of multiple molecules and to an algorithm for clustering molecules into groups, with each group sharing a substantial MCS. Our methods are flexible in that they provide exquisite control on various matching criteria used to define a common substructure.