RESUMO
The recent advancements in machine learning and the new availability of large chemical datasets made the development of tools and protocols for computational chemistry a topic of high interest. In this chapter a standard procedure to develop Quantitative Structure-Activity Relationship (QSAR) models was presented and implemented in two freely available and easy-to-use workflows. The first workflow helps the user retrieving chemical data (SMILES) from the web, checking their correctness and curating them to produce consistent and ready-to-use datasets for cheminformatic. The second workflow implements six machine learning methods to develop classification QSAR models. Models can be additionally used to predict external chemicals. Calculation and selection of chemical descriptors, tuning of models' hyperparameters, and methods to handle data unbalancing are also incorporated in the workflow. Both the workflows are implemented in KNIME and represent a useful tool for computational scientists, as well as an intuitive and straightforward introduction to QSAR.
Assuntos
Curadoria de Dados , Aprendizado de Máquina , Relação Quantitativa Estrutura-Atividade , Fluxo de Trabalho , Curadoria de Dados/métodos , Software , Quimioinformática/métodos , Biologia Computacional/métodosRESUMO
The Asclepios suite of KNIME nodes represents an innovative solution for conducting cheminformatics and computational chemistry tasks, specifically tailored for applications in drug discovery and computational toxicology. This suite has been developed using open-source and publicly accessible software. In this chapter, we introduce and explore the Asclepios suite through the lens of a case study. This case study revolves around investigating the interactions between per- and polyfluorinated alkyl substances (PFAS) and biomolecules, such as nuclear receptors. The objective is to characterize the potential toxicity of PFAS and gain insights into their chemical mode of action at the molecular level. The Asclepios KNIME nodes have been designed as versatile tools capable of addressing a wide range of computational toxicology challenges. Furthermore, they can be adapted and customized to accomodate the specific needs of individual users, spanning various domains such as nanoinformatics, biomedical research, and other related applications. This chapter provides an in-depth examination of the technical underpinnings and foundations of these tools. It is accompanied by a practical case study that demonstrates the utilization of Asclepios nodes in a computational toxicology investigation. This showcases the extendable functionalities that can be applied in diverse computational chemistry contexts. By the end of this chapter, we aim for readers to have a comprehensive understanding of the effectiveness of the Asclepios node functions. These functions hold significant potential for enhancing a wide spectrum of cheminformatics applications.
Assuntos
Descoberta de Drogas , Software , Fluxo de Trabalho , Descoberta de Drogas/métodos , Humanos , Toxicologia/métodos , Quimioinformática/métodos , Biologia Computacional/métodos , Fluorocarbonos/química , Fluorocarbonos/toxicidadeRESUMO
Fraction unbound in plasma (fu,p) of drugs is an significant factor for drug delivery and other biological incidences related to the pharmacokinetic behaviours of drugs. Exploration of different molecular fragments for fu,p of different small molecules/agents can facilitate in identification of suitable candidates in the preliminary stage of drug discovery. Different researchers have implemented strategies to build several prediction models for fu,p of different drugs. However, these studies did not focus on the identification of responsible molecular fragments to determine the fraction unbound in plasma. In the current work, we tried to focus on the development of robust classification-based QSAR models and evaluated these models with multiple statistical metrics to identify essential molecular fragments/structural attributes for fractions unbound in plasma. The study unequivocally suggests various N-containing aromatic rings and aliphatic groups have positive influences and sulphur-containing thiadiazole rings have negative influences for the fu,p values. The molecular fragments may help for the assessment of the fu,p values of different small molecules/drugs in a speedy way in comparison to experiment-based in vivo and in vitro studies.
Assuntos
Quimioinformática , Relação Quantitativa Estrutura-Atividade , Humanos , Quimioinformática/métodos , Preparações Farmacêuticas/química , Preparações Farmacêuticas/sangue , Descoberta de Drogas/métodos , Plasma/químicaRESUMO
Organizing and partitioning sets of chemical structures is of considerable practical significance, e.g., in compound library analysis and the postprocessing of screening hit lists. Approaches such as unsupervised clustering are computationally demanding and dataset-dependent; on the other hand, rule-based methods, such as those based on Murcko scaffolds, have linear time complexity but are often too fine-grained, leading to a large number of singletons or sparsely populated classes. An alternative rule-based method that seeks to achieve an optimal balance when grouping compounds into sets is the 'Scaffold Identification and Naming System' (SCINS). To facilitate public use of this previously published method, here, we provide an open-source Python implementation of SCINS, dependent only on RDKit. We show that SCINS can be useful in identifying sparsely and densely populated regions in chemical space in large databases, here exemplified with Enamine REAL Diverse and ChEMBL. We find that Enamine REAL Diverse covers a much smaller SCINS space relative to ChEMBL, whereas the opposite is true when Murcko and generic Murcko scaffolds are considered. Additionally, we show that SCINS can result in chemically intuitive grouping of medium-sized sets of bioactive compounds, which can be useful in compound selection from virtual screening campaigns as well as postprocessing of experimental hit lists. Hence, in this work, we provide both an open-source implementation of SCINS and its characterization with relevant use cases.
Assuntos
Bases de Dados de Compostos Químicos , Quimioinformática/métodos , Bibliotecas de Moléculas Pequenas/química , SoftwareRESUMO
Ultralarge virtual chemical spaces have emerged as a valuable resource for drug discovery, providing access to billions of make-on-demand compounds with high synthetic success rates. Chemical language models can potentially accelerate the exploration of these vast spaces through direct compound generation. However, existing models are not designed to navigate specific virtual chemical spaces and often overlook synthetic accessibility. To address this gap, we introduce product-of-experts (PoE) chemical language models, a modular and scalable approach to navigating ultralarge virtual chemical spaces. This method allows for controlled compound generation within a desired chemical space by combining a prior model pretrained on the target space with expert and anti-expert models fine-tuned using external property-specific data sets. We demonstrate that the PoE chemical language model can generate compounds with desirable properties, such as those that favorably dock to dopamine receptor D2 (DRD2) and are predicted to cross the blood-brain barrier (BBB), while ensuring that the majority of generated compounds are present within the target chemical space. Our results highlight the potential of chemical language models for navigating ultralarge virtual chemical spaces, and we anticipate that this study will motivate further research in this direction. The source code and data are freely available at https://github.com/shuyana/poeclm.
Assuntos
Descoberta de Drogas , Descoberta de Drogas/métodos , Modelos Químicos , Quimioinformática/métodos , Barreira Hematoencefálica/metabolismo , Simulação de Acoplamento Molecular , Receptores de Dopamina D2/metabolismo , Receptores de Dopamina D2/química , HumanosRESUMO
Ubiquitin-specific peptidase 7 (USP7) is a deubiquitinating enzyme that mediates the stability and activity of numerous proteins. At basal expression levels, USP7 stabilizes p53 protein, even in the presence of excess MDM2. However, its overexpression leads to the deubiquitination of MDM2 at a rate faster than p53, leading to p53 degradation and pro-tumorigenic roles. Consequently, it is an attractive target for anticancer drug discovery via the modulation of its allosteric site from which the protein is activated. In this study, molecular modeling techniques and cheminformatics approaches were employed to unravel the potential of eighty compounds to serve as its allosteric site modulators. The compounds were initially subjected to virtual screening. Subsequently, the binding free energies of the top four compounds with the highest binding affinities were calculated, and their drug-likeness, and pharmacokinetic and toxicity profiles were evaluated. Ultimately, the complexes of the protein and hit compounds were subjected to a 100 nanoseconds (ns) molecular dynamics simulation. The results of the study revealed eight compounds from the compound library with docking scores ranging from - 7.491 to -11.43 kcal/mol, compared to P217564, which exhibited a docking score of -5.671 kcal/mol. The top four compounds with the highest affinities possessed drug-like properties, and good pharmacokinetic and toxicity profiles, and their predicted inhibitory potentials showed they will be effective at minimal concentration. Also, molecular dynamics simulation confirmed the stability of the protein-ligand complexes. Conclusively, the compounds identified in this study are worthy of further evaluation for the development of allosteric site modulators of USP7.
Assuntos
Sítio Alostérico , Simulação de Acoplamento Molecular , Simulação de Dinâmica Molecular , Peptidase 7 Específica de Ubiquitina , Peptidase 7 Específica de Ubiquitina/metabolismo , Peptidase 7 Específica de Ubiquitina/antagonistas & inibidores , Peptidase 7 Específica de Ubiquitina/química , Humanos , Quimioinformática/métodos , Descoberta de Drogas , Ligação Proteica , Ligantes , Proteínas Proto-Oncogênicas c-mdm2/metabolismo , Proteínas Proto-Oncogênicas c-mdm2/químicaRESUMO
A knowledge graph (KG) is a technique for modeling entities and their interrelations. Knowledge graph embedding (KGE) translates these entities and relationships into a continuous vector space to facilitate dense and efficient representations. In the domain of chemistry, applying KG and KGE techniques integrates heterogeneous chemical information into a coherent and user-friendly framework, enhances the representation of chemical data features, and is beneficial for downstream tasks, such as chemical property prediction. This paper begins with a comprehensive review of classical and contemporary KGE methodologies, including distance-based models, semantic matching models, and neural network-based approaches. We then catalogue the primary databases employed in chemistry and biochemistry that furnish the KGs with essential chemical data. Subsequently, we explore the latest applications of KG and KGE in chemistry, focusing on risk assessment, property prediction, and drug discovery. Finally, we discuss the current challenges to KG and KGE techniques and provide a perspective on their potential future developments.
Assuntos
Redes Neurais de Computação , Descoberta de Drogas/métodos , Quimioinformática/métodos , Bases de Dados de Compostos Químicos , HumanosRESUMO
Natural products (NPs) are secondary metabolites of natural origin with broad applications across various human activities, particularly the discovery of bioactive compounds. Structural elucidation of new NPs entails significant cost and effort. On the other hand, the dereplication of known compounds is crucial for the early exclusion of irrelevant compounds in contemporary pharmaceutical research. NAPROC-13 stands out as a publicly accessible database, providing structural and 13C NMR spectroscopic information for over 25â¯000 compounds, rendering it a pivotal resource in natural product (NP) research, favoring open science. This study seeks to quantitatively analyze the chemical content, structural diversity, and chemical space coverage of NPs within NAPROC-13, compared to FDA-approved drugs and a very diverse subset of NPs, UNPD-A. Findings indicated that NPs in NAPROC-13 exhibit properties comparable to those in UNPD-A, albeit showcasing a notably diverse array of structural content, scaffolds, ring systems of pharmaceutical interest, and molecular fragments. NAPROC-13 covers a specific region of the chemical multiverse (a generalization of the chemical space from different chemical representations) regarding physicochemical properties and a region as broad as UNPD-A in terms of the structural features represented by fingerprints.
Assuntos
Produtos Biológicos , Produtos Biológicos/química , Estrutura Molecular , Quimioinformática/métodos , Espectroscopia de Ressonância Magnética Nuclear de Carbono-13RESUMO
Analyzing machine learning models, especially nonlinear ones, poses significant challenges. In this context, centered kernel alignment (CKA) has emerged as a promising model analysis tool that assesses the similarity between two embeddings. CKA's efficacy depends on selecting a kernel that adequately captures the underlying properties of the compared models. The model analysis tool was designed for neural networks (NNs) with their invariance to data rotation in mind and has been successfully employed in various scientific domains. However, CKA has rarely been adopted in cheminformatics, partly because of the popularity of the random forest (RF) machine learning algorithm, which is not rotationally invariant. In this work, we present the adaptation of CKA that builds on the RF kernel to match the properties of RF. As part of the method validation, we show that the model analysis method is well-correlated with the prediction similarity of RF models. Furthermore, we demonstrate how CKA with the RF kernel can be utilized to analyze and explain the behavior of RF models derived from molecular and rooted fingerprints.
Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos , Quimioinformática/métodos , Modelos MolecularesRESUMO
With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models for efficient prediction of query compounds' hepatotoxicity. The predictivity of each of these models was evaluated on a large number of test set compounds. The best-performing model was also used to screen a true external data set. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book ( https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm ), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.
Assuntos
Doença Hepática Induzida por Substâncias e Drogas , Relação Quantitativa Estrutura-Atividade , Doença Hepática Induzida por Substâncias e Drogas/etiologia , Algoritmos , Aprendizado de Máquina , Humanos , Quimioinformática/métodosRESUMO
Chemical space is a multidimensional descriptor space that encloses all possible molecules, and at least 1 x 1060 organic substances with a molecular weight below 500 Da are thought to be potentially relevant for drug discovery. Natural products have been the primary source of the new pharmacological entities marketed during the past forty years and continue to be one of the most productive sources for the creation of innovative medications. Chemoinformatics-based computational tools accelerate the drug development process for natural products. Methods including estimating bioactivities, safety profiles, ADME, and natural product likeness measurement have been used. Here, we go over recent developments in chemoinformatic tools designed to visualize, characterize, and expand the chemical space of natural compound data sets using various molecular representations, create visual representations of such spaces, and investigate structure-property relationships within chemical spaces. With an emphasis on drug discovery applications, we evaluate the open-source databases BIOFACQUIM and PeruNPDB as proof of concept.
Assuntos
Produtos Biológicos , Descoberta de Drogas , Produtos Biológicos/química , Descoberta de Drogas/métodos , Quimioinformática/métodos , Bases de Dados de Compostos QuímicosRESUMO
This article aims to provide a comprehensive critical, yet readable, review of general interest to the chemistry community on molecular similarity as applied to chemical informatics and predictive modeling with a special focus on read-across (RA) and read-across structure-activity relationships (RASAR). Molecular similarity-based computational tools, such as quantitative structure-activity relationships (QSARs) and RA, are routinely used to fill the data gaps for a wide range of properties including toxicity endpoints for regulatory purposes. This review will explore the background of RA starting from how structural information has been used through to how other similarity contexts such as physicochemical, absorption, distribution, metabolism, and elimination (ADME) properties, and biological aspects are being characterized. More recent developments of RA's integration with QSAR have resulted in the emergence of novel models such as ToxRead, generalized read-across (GenRA), and quantitative RASAR (q-RASAR). Conventional QSAR techniques have been excluded from this review except where necessary for context.
Assuntos
Aprendizado de Máquina , Relação Quantitativa Estrutura-Atividade , Humanos , Quimioinformática/métodos , Relação Estrutura-Atividade , AnimaisRESUMO
The highly pathogenic Marburg virus (MARV) is a member of the Filoviridae family, a non-segmented negative-strand RNA virus. This article represents the computer-aided drug design (CADD) approach for identifying drug-like compounds that prevent the MARV virus disease by inhibiting nucleoprotein, which is responsible for their replication. This study used a wide range of in silico drug design techniques to identify potential drugs. Out of 368 natural compounds, 202 compounds passed ADMET, and molecular docking identified the top two molecules (CID: 1804018 and 5280520) with a high binding affinity of -6.77 and -6.672 kcal/mol, respectively. Both compounds showed interactions with the common amino acid residues SER_216, ARG_215, TYR_135, CYS_195, and ILE_108, which indicates that lead compounds and control ligands interact in the common active site/catalytic site of the protein. The negative binding free energies of CID: 1804018 and 5280520 were -66.01 and -31.29 kcal/mol, respectively. Two lead compounds were re-evaluated using MD modeling techniques, which confirmed CID: 1804018 as the most stable when complexed with the target protein. PC3 of the (Z)-2-(2,5-dimethoxybenzylidene)-6-(2-(4-methoxyphenyl)-2-oxoethoxy) benzofuran-3(2H)-one (CID: 1804018) was 8.74 %, whereas PC3 of the 2'-Hydroxydaidzein (CID: 5280520) was 11.25 %. In this study, (Z)-2-(2,5-dimethoxybenzylidene)-6-(2-(4-methoxyphenyl)-2-oxoethoxy) benzofuran-3(2H)-one (CID: 1804018) unveiled the significant stability of the proteins' binding site in ADMET, Molecular docking, MM-GBSA and MD simulation analysis studies, which also showed a high negative binding free energy value, confirming as the best drug candidate which is found in Angelica archangelica which may potentially inhibit the replication of MARV nucleoprotein.
Assuntos
Antivirais , Benzofuranos , Marburgvirus , Simulação de Acoplamento Molecular , Replicação Viral , Antivirais/farmacologia , Antivirais/química , Antivirais/metabolismo , Marburgvirus/efeitos dos fármacos , Marburgvirus/metabolismo , Benzofuranos/farmacologia , Benzofuranos/química , Benzofuranos/metabolismo , Replicação Viral/efeitos dos fármacos , Quimioinformática/métodos , Desenho de Fármacos , Ligação Proteica , Proteínas de Ligação a RNA/metabolismo , Proteínas de Ligação a RNA/química , Sítios de Ligação , LigantesRESUMO
Marine natural products (MNPs) continue to be tested primarily in cellular toxicity assays, both mammalian and microbial, despite most being inactive at concentrations relevant to drug discovery. These MNPs become missed opportunities and represent a wasteful use of precious bioresources. The use of cheminformatics aligned with published bioactivity data can provide insights to direct the choice of bioassays for the evaluation of new MNPs. Cheminformatics analysis of MNPs found in MarinLit (n = 39,730) up to the end of 2023 highlighted indol-3-yl-glyoxylamides (IGAs, n = 24) as a group of MNPs with no reported bioactivities. However, a recent review of synthetic IGAs highlighted these scaffolds as privileged structures with several compounds under clinical evaluation. Herein, we report the synthesis of a library of 32 MNP-inspired brominated IGAs (25-56) using a simple one-pot, multistep method affording access to these diverse chemical scaffolds. Directed by a meta-analysis of the biological activities reported for marine indole alkaloids (MIAs) and synthetic IGAs, the brominated IGAs 25-56 were examined for their potential bioactivities against the Parkinson's Disease amyloid protein alpha synuclein (α-syn), antiplasmodial activities against chloroquine-resistant (3D7) and sensitive (Dd2) parasite strains of Plasmodium falciparum, and inhibition of mammalian (chymotrypsin and elastase) and viral (SARS-CoV-2 3CLpro) proteases. All of the synthetic IGAs tested exhibited binding affinity to the amyloid protein α-syn, while some showed inhibitory activities against P. falciparum, and the proteases, SARS-CoV-2 3CLpro, and chymotrypsin. The cellular safety of the IGAs was examined against cancerous and non-cancerous human cell lines, with all of the compounds tested inactive, thereby validating cheminformatics and meta-analyses results. The findings presented herein expand our knowledge of marine IGA bioactive chemical space and advocate expanding the scope of biological assays routinely used to investigate NP bioactivities, specifically those more suitable for non-toxic compounds. By integrating cheminformatics tools and functional assays into NP biological testing workflows, we can aim to enhance the potential of NPs and their scaffolds for future drug discovery and development.
Assuntos
Produtos Biológicos , Quimioinformática , Descoberta de Drogas , Produtos Biológicos/química , Produtos Biológicos/farmacologia , Humanos , Quimioinformática/métodos , SARS-CoV-2/efeitos dos fármacos , Organismos Aquáticos/química , Indóis/química , Indóis/farmacologia , Plasmodium falciparum/efeitos dos fármacos , Alcaloides Indólicos/farmacologia , Alcaloides Indólicos/química , AnimaisRESUMO
Indoleamine 2,3-dioxygenase (IDO) and tryptophan 2,3-dioxygenase (TDO) are attractive drug targets for cancer immunotherapy. After disappointing results of the epacadostat as a selective IDO inhibitor in phase III clinical trials, there is much interest in the development of the TDO selective inhibitors. In the current study, several data analysis methods and machine learning approaches including logistic regression, Random Forest, XGBoost and Support Vector Machines were used to model a data set of compounds retrieved from ChEMBL. Models based on the Morgan fingerprints revealed notable fragments for the selective inhibition of the IDO, TDO or both. Multiple fragment docking was performed to find the best set of bound fragments and their orientation in the space for efficient linking. Linking the fragments and optimization of the final molecules were accomplished by means of an artificial intelligence generative framework. Finally, selectivity of the optimized molecules was assessed and the top 4 lead molecules were filtered through PAINS, Brenk and NIH filters. Results indicated that phenyloxalamide, fluoroquinoline, and 3-bromo-4-fluroaniline confer selectivity towards the IDO inhibition. Correspondingly, 1-benzyl-1H-naphtho[2,3-d][1,2,3]triazole-4,9-dione was found to be an integral fragment for the selective inhibition of the TDO by constituting a coordination bond with the Fe atom of heme. In addition, furo[2,3-c]pyridine-2,3-diamine was found as a common fragment for inhibition of the both targets and can be used in the design of the dual target inhibitors of the IDO and TDO. The new fragments introduced here can be a useful building blocks for incorporation into the selective TDO or dual IDO/TDO inhibitors.
Assuntos
Quimioinformática , Inibidores Enzimáticos , Indolamina-Pirrol 2,3,-Dioxigenase , Aprendizado de Máquina , Triptofano Oxigenase , Indolamina-Pirrol 2,3,-Dioxigenase/antagonistas & inibidores , Indolamina-Pirrol 2,3,-Dioxigenase/química , Indolamina-Pirrol 2,3,-Dioxigenase/metabolismo , Triptofano Oxigenase/antagonistas & inibidores , Triptofano Oxigenase/metabolismo , Triptofano Oxigenase/química , Humanos , Quimioinformática/métodos , Inibidores Enzimáticos/química , Simulação de Acoplamento MolecularRESUMO
The development of new treatments for neglected tropical diseases (NTDs) remains a major challenge in the 21st century. In most cases, the available drugs are obsolete and have limitations in terms of efficacy and safety. The situation becomes even more complex when considering the low number of new chemical entities (NCEs) currently in use in advanced clinical trials for most of these diseases. Natural products (NPs) are valuable sources of hits and lead compounds with privileged scaffolds for the discovery of new bioactive molecules. Considering the relevance of biodiversity for drug discovery, a chemoinformatics analysis was conducted on a compound dataset of NPs with anti-trypanosomatid activity reported in 497 research articles from 2019 to 2024. Structures corresponding to different metabolic classes were identified, including terpenoids, benzoic acids, benzenoids, steroids, alkaloids, phenylpropanoids, peptides, flavonoids, polyketides, lignans, cytochalasins, and naphthoquinones. This unique collection of NPs occupies regions of the chemical space with drug-like properties that are relevant to anti-trypanosomatid drug discovery. The gathered information greatly enhanced our understanding of biologically relevant chemical classes, structural features, and physicochemical properties. These results can be useful in guiding future medicinal chemistry efforts for the development of NP-inspired NCEs to treat NTDs caused by trypanosomatid parasites.
Assuntos
Biodiversidade , Produtos Biológicos , Quimioinformática , Descoberta de Drogas , Doenças Negligenciadas , Animais , Humanos , Produtos Biológicos/química , Produtos Biológicos/farmacologia , Produtos Biológicos/uso terapêutico , Quimioinformática/métodos , Descoberta de Drogas/métodos , Doenças Negligenciadas/tratamento farmacológico , Tripanossomicidas/química , Tripanossomicidas/farmacologia , Tripanossomicidas/uso terapêutico , Trypanosoma/efeitos dos fármacosRESUMO
This study addresses the challenge of accurately identifying stereoisomers in cheminformatics, which originates from our objective to apply machine learning to predict the association constant between cyclodextrin and a guest. Identifying stereoisomers is indeed crucial for machine learning applications. Current tools offer various molecular descriptors, including their textual representation as Isomeric SMILES that can distinguish stereoisomers. However, such representation is text-based and does not have a fixed size, so a conversion is needed to make it usable to machine learning approaches. Word embedding techniques can be used to solve this problem. Mol2vec, a word embedding approach for molecules, offers such a conversion. Unfortunately, it cannot distinguish between stereoisomers due to its inability to capture the spatial configuration of molecular structures. This study proposes several approaches that use word embedding techniques to handle molecular discrimination using stereochemical information on molecules or considering Isomeric SMILES notation as a text in Natural Language Processing. Our aim is to generate a distinct vector for each unique molecule, correctly identifying stereoisomer information in cheminformatics. The proposed approaches are then compared to our original machine learning task: predicting the association constant between cyclodextrin and a guest molecule.
Assuntos
Aprendizado de Máquina , Estereoisomerismo , Quimioinformática/métodos , Ciclodextrinas/química , Processamento de Linguagem NaturalRESUMO
Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
Assuntos
Aprendizado de Máquina , Mineração de Dados/métodos , Bases de Dados de Compostos Químicos , Algoritmos , Quimioinformática/métodosRESUMO
One of the most challenging tasks in modern medicine is to find novel efficient cancer therapeutic methods with minimal side effects. The recent discovery of several classes of organic molecules known as "molecular jackhammers" is a promising development in this direction. It is known that these molecules can directly target and eliminate cancer cells with no impact on healthy tissues. However, the underlying microscopic picture remains poorly understood. We present a study that utilizes theoretical analysis together with experimental measurements to clarify the microscopic aspects of jackhammers' anticancer activities. Our physical-chemical approach combines statistical analysis with chemoinformatics methods to design and optimize molecular jackhammers. By correlating specific physical-chemical properties of these molecules with their abilities to kill cancer cells, several important structural features are identified and discussed. Although our theoretical analysis enhances understanding of the molecular interactions of jackhammers, it also highlights the need for further research to comprehensively elucidate their mechanisms and to develop a robust physical-chemical framework for the rational design of targeted anticancer drugs.
Assuntos
Antineoplásicos , Quimioinformática , Humanos , Antineoplásicos/farmacologia , Antineoplásicos/química , Quimioinformática/métodos , Neoplasias/tratamento farmacológico , Neoplasias/patologia , Linhagem Celular Tumoral , Modelos MolecularesRESUMO
Chemical information has become increasingly ubiquitous and has outstripped the pace of analysis and interpretation. We have developed an R package, uafR, that automates a grueling retrieval process for gas -chromatography coupled mass spectrometry (GC -MS) data and allows anyone interested in chemical comparisons to quickly perform advanced structural similarity matches. Our streamlined cheminformatics workflows allow anyone with basic experience in R to pull out component areas for tentative compound identifications using the best published understanding of molecules across samples (pubchem.gov). Interpretations can now be done at a fraction of the time, cost, and effort it would typically take using a standard chemical ecology data analysis pipeline. The package was tested in two experimental contexts: (1) A dataset of purified internal standards, which showed our algorithms correctly identified the known compounds with R2 values ranging from 0.827-0.999 along concentrations ranging from 1 × 10-5 to 1 × 103 ng/µl, (2) A large, previously published dataset, where the number and types of compounds identified were comparable (or identical) to those identified with the traditional manual peak annotation process, and NMDS analysis of the compounds produced the same pattern of significance as in the original study. Both the speed and accuracy of GC -MS data processing are drastically improved with uafR because it allows users to fluidly interact with their experiment following tentative library identifications [i.e. after the m/z spectra have been matched against an installed chemical fragmentation database (e.g. NIST)]. Use of uafR will allow larger datasets to be collected and systematically interpreted quickly. Furthermore, the functions of uafR could allow backlogs of previously collected and annotated data to be processed by new personnel or students as they are being trained. This is critical as we enter the era of exposomics, metabolomics, volatilomes, and landscape level, high-throughput chemotyping. This package was developed to advance collective understanding of chemical data and is applicable to any research that benefits from GC -MS analysis. It can be downloaded for free along with sample datasets from Github at github.org/castratton/uafR or installed directly from R or RStudio using the developer tools: 'devtools::install_github("castratton/uafR")'.