RESUMO
Visualization and analysis of large chemical reaction networks become rather challenging when conventional graph-based approaches are used. As an alternative, we propose to use the chemical cartography ("chemography") approach, describing the data distribution on a 2-dimensional map. Here, the Generative Topographic Mapping (GTM) algorithm - an advanced chemography approach - has been applied to visualize the reaction path network of a simplified Wilkinson's catalyst-catalyzed hydrogenation containing some 105 structures generated with the help of the Artificial Force Induced Reaction (AFIR) method using either Density Functional Theory or Neural Network Potential (NNP) for potential energy surface calculations. Using new atoms permutation invariant 3D descriptors for structure encoding, we've demonstrated that GTM possesses the abilities to cluster structures that share the same 2D representation, to visualize potential energy surface, to provide an insight on the reaction path exploration as a function of time and to compare reaction path networks obtained with different methods of energy assessment.
RESUMO
The advent of high-performance virtual screening techniques nowadays allows drug designers to explore ultra-large sets of candidate compounds in search of molecules predicted to have desired properties. However, the success of such an endeavor heavily relies on the pertinence (drug-likeness and, foremost, chemical feasibility) of these candidates, or otherwise, virtual screening will return valueless "hits", by the garbage in/garbage out principle. The huge popularity of the judiciously enumerated Enamine REAL Space is clear proof of the strength of this Big Data trend in drug discovery. Here we describe a new dataset of make-on-demand compounds called the Freedom space. It follows the principles of Enamine REAL Space and contains highly feasible molecules (synthesis success rate over 75 percent). However, the scaffold and chemography analysis revealed significant differences to both the REAL and biologically annotated compounds from the ChEMBL database. The Freedom Space is a significant extension of the REAL Space and can be utilized for a more comprehensive exploration of the synthetically feasible chemical space in hit finding and hit-to-lead campaigns.
RESUMO
The Bone-Marrow derived Dendritic Cell (BMDC) test is a promising assay for identifying sensitizing chemicals based on the 3Rs (Replace, Reduce, Refine) principle. This study expanded the BMDC benchmarking to various in vitro, in chemico, and in silico assays targeting different key events (KE) in the skin sensitization pathway, using common substances datasets. Additionally, a Quantitative Structure-Activity Relationship (QSAR) model was developed to predict the BMDC test outcomes for sensitizing or non-sensitizing chemicals. The modeling workflow involved ISIDA (In Silico Design and Data Analysis) molecular fragment descriptors and the SVM (Support Vector Machine) machine-learning method. The BMDC model's performance was at least comparable to that of all ECVAM-validated models regardless of the KE considered. Compared with other tests targeting KE3, related to dendritic cell activation, BMDC assay was shown to have higher balanced accuracy and sensitivity concerning both the Local Lymph Node Assay (LLNA) and human labels, providing additional evidence for its reliability. The consensus QSAR model exhibits promising results, correlating well with observed sensitization potential. Integrated into a publicly available web service, the BMDC-based QSAR model may serve as a cost-effective and rapid alternative to lab experiments, providing preliminary screening for sensitization potential, compound prioritization, optimization and risk assessment.
Assuntos
Benchmarking , Células Dendríticas , Relação Quantitativa Estrutura-Atividade , Células Dendríticas/efeitos dos fármacos , Humanos , Animais , Máquina de Vetores de Suporte , Simulação por Computador , Dermatite Alérgica de Contato , Alérgenos/toxicidade , Alternativas aos Testes com Animais/métodos , Células da Medula Óssea/efeitos dos fármacos , Ensaio Local de Linfonodo , CamundongosRESUMO
The cutaneous absorption parameters of xenobiotics are crucial for the development of drugs and cosmetics, as well as for assessing environmental and occupational chemical risks. Despite the great variability in the design of experimental conditions due to uncertain international guidelines, datasets like HuskinDB have been created to report skin absorption endpoints. This review updates available skin permeability data by rigorously compiling research published between 2012 and 2021. Inclusion and exclusion criteria have been selected to build the most harmonized and reusable dataset possible. The Generative Topographic Mapping method was applied to the present dataset and compared to HuskinDB to monitor the progress in skin permeability research and locate chemotypes of particular concern. The open-source dataset (SkinPiX) includes steady-state flux, maximum flux, lag time and permeability coefficient results for the substances tested, as well as relevant information on experimental parameters that can impact the data. It can be used to extract subsets of data for comparisons and to build predictive models.
Assuntos
Absorção Cutânea , Pele , Xenobióticos , Permeabilidade , Pele/metabolismo , Xenobióticos/metabolismo , Conjuntos de Dados como Assunto , HumanosRESUMO
Increasing antimicrobial resistance (AMR) represents a global healthcare threat. To decrease the spread of AMR and associated mortality, methods for rapid selection of optimal antibiotic treatment are urgently needed. Machine learning (ML) models based on genomic data to predict resistant phenotypes can serve as a fast screening tool prior to phenotypic testing. Nonetheless, many existing ML methods lack interpretability. Therefore, we present a methodology for visualization of sequence space and AMR prediction based on the non-linear dimensionality reduction method - generative topographic mapping (GTM). This approach, applied to AMR data of >5000 S. aureus isolates retrieved from the PATRIC database, yielded GTM models with reasonable accuracy for all drugs (balanced accuracy values ≥0.75). The Generative Topographic Maps (GTMs) represent data in the form of illustrative maps of the genomic space and allow for antibiotic-wise comparison of resistant phenotypes. The maps were also found to be useful for the analysis of genetic determinants responsible for drug resistance. Overall, the GTM-based methodology is a useful tool for both the illustrative exploration of the genomic sequence space and AMR prediction.
Assuntos
Antibacterianos , Farmacorresistência Bacteriana , Aprendizado de Máquina , Staphylococcus aureus , Staphylococcus aureus/efeitos dos fármacos , Staphylococcus aureus/genética , Antibacterianos/farmacologia , Farmacorresistência Bacteriana/genética , Farmacorresistência Bacteriana/efeitos dos fármacos , Genoma Bacteriano , Genômica/métodos , HumanosRESUMO
Quantitative structure-activity relationship (QSAR) modelling, an approach that was introduced 60 years ago, is widely used in computer-aided drug design. In recent years, progress in artificial intelligence techniques, such as deep learning, the rapid growth of databases of molecules for virtual screening and dramatic improvements in computational power have supported the emergence of a new field of QSAR applications that we term 'deep QSAR'. Marking a decade from the pioneering applications of deep QSAR to tasks involved in small-molecule drug discovery, we herein describe key advances in the field, including deep generative and reinforcement learning approaches in molecular design, deep learning models for synthetic planning and the application of deep QSAR models in structure-based virtual screening. We also reflect on the emergence of quantum computing, which promises to further accelerate deep QSAR applications and the need for open-source and democratized resources to support computer-aided drug design.
Assuntos
Aprendizado Profundo , Relação Quantitativa Estrutura-Atividade , Humanos , Inteligência Artificial , Metodologias Computacionais , Teoria Quântica , Descoberta de Drogas/métodos , Desenho de FármacosRESUMO
Kinetic aqueous or buffer solubility is important parameter measuring suitability of compounds for high throughput assays in early drug discovery while thermodynamic solubility is reserved for later stages of drug discovery and development. Kinetic solubility is also considered to have low inter-laboratory reproducibility because of its sensitivity to protocol parameters [1]. Presumably, this is why little efforts have been put to build QSPR models for kinetic in comparison to thermodynamic aqueous solubility. Here, we investigate the reproducibility and modelability of kinetic solubility assays. We first analyzed the relationship between kinetic and thermodynamic solubility data, and then examined the consistency of data from different kinetic assays. In this contribution, we report differences between kinetic and thermodynamic solubility data that are consistent with those reported by others [1, 2] and good agreement between data from different kinetic solubility campaigns in contrast to general expectations. The latter is confirmed by achieving high performing QSPR models trained on merged kinetic solubility datasets. The poor performance of QSPR model trained on thermodynamic solubility when applied to kinetic solubility dataset reinforces the conclusion that kinetic and thermodynamic solubilities do not correlate: one cannot be used as an ersatz for the other. This encourages for building predictive models for kinetic solubility. The kinetic solubility QSPR model developed in this study is freely accessible through the Predictor web service of the Laboratory of Chemoinformatics (https://chematlas.chimie.unistra.fr/cgi-bin/predictor2.cgi).
Assuntos
Descoberta de Drogas , Ensaios de Triagem em Larga Escala , Solubilidade , Reprodutibilidade dos Testes , Água , Aprendizado de MáquinaRESUMO
Computational design of chiral organic catalysts for asymmetric synthesis is a promising technology that can significantly reduce the material and human resources required for the preparation of enantiopure compounds. Herein, for the modeling of catalysts' enantioselectivity, we propose to use the multi-instance learning approach accounting for multiple catalyst conformers and requiring neither conformer selection nor their spatial alignment. A catalyst was represented by an ensemble of conformers, each encoded by three-dimesinonal (3D) pmapper descriptors. A catalyzed reactant transformation was converted into a single molecular graph, a condensed graph of reaction, encoded by 2D fragment descriptors. A whole chemical reaction was finally encoded by concatenated 3D catalyst and 2D transformation descriptors. The performance of the proposed method was demonstrated in the modeling of the enantioselectivity of homogeneous and phase-transfer reactions and compared with the state-of-the-art approaches.
Assuntos
CatáliseRESUMO
We report the major highlights of the School of Cheminformatics in Latin America, Mexico City, November 24-25, 2022. Six lectures, one workshop, and one roundtable with four editors were presented during an online public event with speakers from academia, big pharma, and public research institutions. One thousand one hundred eighty-one students and academics from seventy-nine countries registered for the meeting. As part of the meeting, advances in enumeration and visualization of chemical space, applications in natural product-based drug discovery, drug discovery for neglected diseases, toxicity prediction, and general guidelines for data analysis were discussed. Experts from ChEMBL presented a workshop on how to use the resources of this major compounds database used in cheminformatics. The school also included a round table with editors of cheminformatics journals. The full program of the meeting and the recordings of the sessions are publicly available at https://www.youtube.com/@SchoolChemInfLA/featured .
RESUMO
In chemical library analysis, it may be useful to describe libraries as individual items rather than collections of compounds. This is particularly true for ultra-large noncherry-pickable compound mixtures, such as DNA-encoded libraries (DELs). In this sense, the chemical library space (CLS) is useful for the management of a portfolio of libraries, just like chemical space (CS) helps manage a portfolio of molecules. Several possible CLSs were previously defined using vectorial library representations obtained from generative topographic mapping (GTM). Given the steadily growing number of DEL designs, the CLS becomes "crowded" and requires analysis tools beyond pairwise library comparison. Therefore, herein, we investigate the cartography of CLS on meta-(µ)GTMsâ"meta" to remind that these are maps of the CLS, itself based on responsibility vectors issued by regular CS GTMs. 2,5 K DELs and ChEMBL (reference) were projected on the µGTM, producing landscapes of library-specific properties. These describe both interlibrary similarity and intrinsic library characteristics in the same view, herewith facilitating the selection of the best project-specific libraries.
Assuntos
Bibliotecas de Moléculas Pequenas , Biblioteca GênicaRESUMO
This study introduces a new de novo design algorithm called GENERA that combines the capabilities of a deep-learning algorithm for automated drug-like analogue design, called DeLA-Drug, with a genetic algorithm for generating molecules with desired target-oriented properties. Specifically, GENERA was applied to the angiotensin-converting enzyme 2 (ACE2) target, which is implicated in many pathological conditions, including COVID-19. The ability of GENERA to de novo design promising candidates for a specific target was assessed using two docking programs, PLANTS and GLIDE. A fitness function based on the Pareto dominance resulting from computed PLANTS and GLIDE scores was applied to demonstrate the algorithm's ability to perform multiobjective optimizations effectively. GENERA can quickly generate focused libraries that produce better scores compared to a starting set of known ACE-2 binders. This study is the first to utilize a DL-based algorithm designed for analogue generation as a mutational operator within a GA framework, representing an innovative approach to target-oriented de novo design.
Assuntos
COVID-19 , Aprendizado Profundo , Humanos , Algoritmos , Desenho de FármacosRESUMO
Conjugated QSPR models for reactions integrate fundamental chemical laws expressed by mathematical equations with machine learning algorithms. Herein we present a methodology for building conjugated QSPR models integrated with the Arrhenius equation. Conjugated QSPR models were used to predict kinetic characteristics of cycloaddition reactions related by the Arrhenius equation: rate constant l o g k ${{\rm l}{\rm o}{\rm g}k}$ , pre-exponential factor l o g A ${{\rm l}{\rm o}{\rm g}A}$ , and activation energy E a ${{E}_{{\rm a}}}$ . They were benchmarked against single-task (individual and equation-based models) and multi-task models. In individual models, all characteristics were modeled separately, while in multi-task models l o g k ${{\rm l}{\rm o}{\rm g}k}$ , l o g A ${{\rm l}{\rm o}{\rm g}A}$ and E a ${{E}_{{\rm a}}}$ were treated cooperatively. An equation-based model assessed l o g k ${{\rm l}{\rm o}{\rm g}k}$ using the Arrhenius equation and l o g A ${{\rm l}{\rm o}{\rm g}A}$ and E a ${{E}_{{\rm a}}}$ values predicted by individual models. It has been demonstrated that the conjugated QSPR models can accurately predict the reaction rate constants at extreme temperatures, at which reaction rate constants hardly can be measured experimentally. Also, in the case of small training sets conjugated models are more robust than related single-task approaches.
RESUMO
The development of DNA-encoded library (DEL) technology introduced new challenges for the analysis of chemical libraries. It is often useful to consider a chemical library as a stand-alone chemoinformatic objectârepresented both as a collection of independent molecules, and yet an individual entityâin particular, when they are inseparable mixtures, like DELs. Herein, we introduce the concept of chemical library space (CLS), in which resident items are individual chemical libraries. We define and compare four vectorial library representations obtained using generative topographic mapping. These allow for an effective comparison of libraries, with the ability to tune and chemically interpret the similarity relationships. In particular, property-tuned CLS encodings enable to simultaneously compare libraries with respect to both property and chemotype distributions. We apply the various CLS encodings for the selection problem of DELs that optimally "match" a reference collection (here ChEMBL28), showing how the choice of the CLS descriptors may help to fine-tune the "matching" (overlap) criteria. Hence, the proposed CLS may represent a new efficient way for polyvalent analysis of thousands of chemical libraries. Selection of an easily accessible compound collection for drug discovery, as a substitute for a difficult to produce reference library, can be tuned for either primary or target-focused screening, also considering property distributions of compounds. Alternatively, selection of libraries covering novel regions of the chemical space with respect to a reference compound subspace may serve for library portfolio enrichment.
Assuntos
DNA , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química , DNA/química , Biblioteca Gênica , Descoberta de Drogas/métodosRESUMO
Ab initio kinetic studies are important to understand and design novel chemical reactions. While the Artificial Force Induced Reaction (AFIR) method provides a convenient and efficient framework for kinetic studies, accurate explorations of reaction path networks incur high computational costs. In this article, we are investigating the applicability of Neural Network Potentials (NNP) to accelerate such studies. For this purpose, we are reporting a novel theoretical study of ethylene hydrogenation with a transition metal complex inspired by Wilkinson's catalyst, using the AFIR method. The resulting reaction path network was analyzed by the Generative Topographic Mapping method. The network's geometries were then used to train a state-of-the-art NNP model, to replace expensive ab initio calculations with fast NNP predictions during the search. This procedure was applied to run the first NNP-powered reaction path network exploration using the AFIR method. We discovered that such explorations are particularly challenging for general purpose NNP models, and we identified the underlying limitations. In addition, we are proposing to overcome these challenges by complementing NNP models with fast semiempirical predictions. The proposed solution offers a generally applicable framework, laying the foundations to further accelerate ab initio kinetic studies with Machine Learning Force Fields, and ultimately explore larger systems that are currently inaccessible.
Assuntos
Redes Neurais de Computação , Cinética , HidrogenaçãoRESUMO
Catalyst optimization processes typically rely on inductive and qualitative assumptions of chemists based on screening data. While machine learning models using molecular properties or calculated 3D structures enable quantitative data evaluation, costly quantum chemical calculations are often required. In contrast, readily available binary fingerprint descriptors are time- and cost-efficient, but their predictive performance remains insufficient. Here, we describe a machine learning model based on fragment descriptors, which are fine-tuned for asymmetric catalysis and represent cyclic or polyaromatic hydrocarbons, enabling robust and efficient virtual screening. Using training data with only moderate selectivities, we designed theoretically and validated experimentally new catalysts showing higher selectivities in a challenging asymmetric tetrahydropyran synthesis.
RESUMO
In order to analyze the Chimiothèque Nationale (CN) - The French National Compound Library - in the context of screening and biologically relevant compounds, the library was compared with ZINC in-stock collection and ChEMBL. This includes the study of chemical space coverage, physicochemical properties and Bemis-Murcko (BM) scaffold populations. More than 5â K CN-unique scaffolds (relative to ZINC and ChEMBL collections) were identified. Generative Topographic Maps (GTMs) accommodating those libraries were generated and used to compare the compound populations. Hierarchical GTM («zooming¼) was applied to generate an ensemble of maps at various resolution levels, from global overview to precise mapping of individual structures. The respective maps were added to the ChemSpace Atlas website. The analysis of synthetic accessibility in the context of combinatorial chemistry showed that only 29,7 % of CN compounds can be fully synthesized using commercially available building blocks.
Assuntos
Bases de Dados de Compostos QuímicosRESUMO
In order to better foramize it, the notorious inverse-QSAR problem (finding structures of given QSAR-predicted properties) is considered in this paper as a two-step process including (i) finding "seed" descriptor vectors corresponding to user-constrained QSAR model output values and (ii) identifying the chemical structures best matching the "seed" vectors. The main development effort here was focused on the latter stage, proposing a new attention-based conditional variational autoencoder neural-network architecture based on recent developments in attention-based methods. The obtained results show that this workflow was capable of generating compounds predicted to display desired activity while being completely novel compared to the training database (ChEMBL). Moreover, the generated compounds show acceptable druglikeness and synthetic accessibility. Both pharmacophore and docking studies were carried out as "orthogonal" in silico validation methods, proving that some of de novo structures are, beyond being predicted active by 2D-QSAR models, clearly able to match binding 3D pharmacophores and bind the protein pocket.
Assuntos
Relação Quantitativa Estrutura-Atividade , Simulação de Acoplamento MolecularRESUMO
We report a novel approach for grading chemical structure drawings for remote teaching, integrated into the Moodle platform. Typically, existing online platforms use a binary grading system, which often fails to give a nuanced evaluation of the answers given by the students. Therefore, such platforms are unevenly adapted to different disciplines. This is particularly true in the case of chemical structures, where most questions simply cannot be evaluated on a true/false basis. Specifically, a strict comparison of candidate and expected chemical structures is not sufficient when some tolerance is deemed acceptable. To overcome this limitation, we have developed a grading workflow based on the pairwise similarity score of two considered chemical structures. This workflow is implemented as a Moodle plugin, using the Chemdoodle engine for drawing structures and communicating with a REST server to compute the similarity score using molecular descriptors. The plugin ( https://github.com/Laboratoire-de-Chemoinformatique/moodle-qtype_molsimilarity ) is easily adaptable to any academic user; both embedding and similarity measures can be configured.
RESUMO
New models for ACE2 receptor binding, based on QSAR and docking algorithms were developed, using XRD structural data and ChEMBL 26 database hits as training sets. The selectivity of the potential ACE2-binding ligands towards Neprilysin (NEP) and ACE was evaluated. The Enamine screening collection (3.2 million compounds) was virtually screened according to the above models, in order to find possible ACE2-chemical probes, useful for the study of SARS-CoV2-induced neurological disorders. An enzymology inhibition assay for ACE2 was optimized, and the combined diversified set of predicted selective ACE2-binding molecules from QSAR modeling, docking, and ultrafast docking was screened in vitro. The in vitro hits included two novel chemotypes suitable for further optimization.
Assuntos
Enzima de Conversão de Angiotensina 2 , COVID-19 , Humanos , Simulação de Acoplamento Molecular , Peptidil Dipeptidase A/metabolismo , RNA Viral , SARS-CoV-2RESUMO
Nowadays, drug discovery is inevitably intertwined with the usage of large compound collections. Understanding of their chemotype composition and physicochemical property profiles is of the highest importance for successful hit identification. Efficient polyfunctional tools allowing multifaceted analysis of constantly growing chemical libraries must be Big Data-compatible. Here, we present the freely accessible ChemSpace Atlas (https://chematlas.chimie.unistra.fr), which includes almost 40K hierarchically organized Generative Topographic Maps (GTM) accommodating up to 500 M compounds covering fragment-like, lead-like, drug-like, PPI-like, and NP-like chemical subspaces. They allow users to navigate and analyze ZINC, ChEMBL, and COCONUT from multiple perspectives on different scales: from a bird's eye view of the entire library to structural pattern detection in small clusters. Around 20 physicochemical properties and almost 750 biological activities can be visualized (associated with map zones), supporting activity profiling and analogue search. Moreover, ChemScape Atlas will be extended toward new chemical subspaces (e.g., DNA-encoded libraries and synthons) and functionalities (ADMETox profiling and property-guided de novo compound generation).