Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 371
Filter
1.
Methods Mol Biol ; 2834: 115-130, 2025.
Article in English | MEDLINE | ID: mdl-39312162

ABSTRACT

The recent advancements in machine learning and the new availability of large chemical datasets made the development of tools and protocols for computational chemistry a topic of high interest. In this chapter a standard procedure to develop Quantitative Structure-Activity Relationship (QSAR) models was presented and implemented in two freely available and easy-to-use workflows. The first workflow helps the user retrieving chemical data (SMILES) from the web, checking their correctness and curating them to produce consistent and ready-to-use datasets for cheminformatic. The second workflow implements six machine learning methods to develop classification QSAR models. Models can be additionally used to predict external chemicals. Calculation and selection of chemical descriptors, tuning of models' hyperparameters, and methods to handle data unbalancing are also incorporated in the workflow. Both the workflows are implemented in KNIME and represent a useful tool for computational scientists, as well as an intuitive and straightforward introduction to QSAR.


Subject(s)
Data Curation , Machine Learning , Quantitative Structure-Activity Relationship , Workflow , Data Curation/methods , Software , Cheminformatics/methods , Computational Biology/methods
2.
Methods Mol Biol ; 2834: 393-441, 2025.
Article in English | MEDLINE | ID: mdl-39312176

ABSTRACT

The Asclepios suite of KNIME nodes represents an innovative solution for conducting cheminformatics and computational chemistry tasks, specifically tailored for applications in drug discovery and computational toxicology. This suite has been developed using open-source and publicly accessible software. In this chapter, we introduce and explore the Asclepios suite through the lens of a case study. This case study revolves around investigating the interactions between per- and polyfluorinated alkyl substances (PFAS) and biomolecules, such as nuclear receptors. The objective is to characterize the potential toxicity of PFAS and gain insights into their chemical mode of action at the molecular level. The Asclepios KNIME nodes have been designed as versatile tools capable of addressing a wide range of computational toxicology challenges. Furthermore, they can be adapted and customized to accomodate the specific needs of individual users, spanning various domains such as nanoinformatics, biomedical research, and other related applications. This chapter provides an in-depth examination of the technical underpinnings and foundations of these tools. It is accompanied by a practical case study that demonstrates the utilization of Asclepios nodes in a computational toxicology investigation. This showcases the extendable functionalities that can be applied in diverse computational chemistry contexts. By the end of this chapter, we aim for readers to have a comprehensive understanding of the effectiveness of the Asclepios node functions. These functions hold significant potential for enhancing a wide spectrum of cheminformatics applications.


Subject(s)
Drug Discovery , Software , Workflow , Drug Discovery/methods , Humans , Toxicology/methods , Cheminformatics/methods , Computational Biology/methods , Fluorocarbons/chemistry , Fluorocarbons/toxicity
3.
Crit Rev Toxicol ; 54(9): 659-684, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39225123

ABSTRACT

This article aims to provide a comprehensive critical, yet readable, review of general interest to the chemistry community on molecular similarity as applied to chemical informatics and predictive modeling with a special focus on read-across (RA) and read-across structure-activity relationships (RASAR). Molecular similarity-based computational tools, such as quantitative structure-activity relationships (QSARs) and RA, are routinely used to fill the data gaps for a wide range of properties including toxicity endpoints for regulatory purposes. This review will explore the background of RA starting from how structural information has been used through to how other similarity contexts such as physicochemical, absorption, distribution, metabolism, and elimination (ADME) properties, and biological aspects are being characterized. More recent developments of RA's integration with QSAR have resulted in the emergence of novel models such as ToxRead, generalized read-across (GenRA), and quantitative RASAR (q-RASAR). Conventional QSAR techniques have been excluded from this review except where necessary for context.


Subject(s)
Machine Learning , Quantitative Structure-Activity Relationship , Humans , Cheminformatics/methods , Structure-Activity Relationship , Animals
4.
J Nat Prod ; 87(9): 2216-2229, 2024 Sep 27.
Article in English | MEDLINE | ID: mdl-39269718

ABSTRACT

Natural products (NPs) are secondary metabolites of natural origin with broad applications across various human activities, particularly the discovery of bioactive compounds. Structural elucidation of new NPs entails significant cost and effort. On the other hand, the dereplication of known compounds is crucial for the early exclusion of irrelevant compounds in contemporary pharmaceutical research. NAPROC-13 stands out as a publicly accessible database, providing structural and 13C NMR spectroscopic information for over 25 000 compounds, rendering it a pivotal resource in natural product (NP) research, favoring open science. This study seeks to quantitatively analyze the chemical content, structural diversity, and chemical space coverage of NPs within NAPROC-13, compared to FDA-approved drugs and a very diverse subset of NPs, UNPD-A. Findings indicated that NPs in NAPROC-13 exhibit properties comparable to those in UNPD-A, albeit showcasing a notably diverse array of structural content, scaffolds, ring systems of pharmaceutical interest, and molecular fragments. NAPROC-13 covers a specific region of the chemical multiverse (a generalization of the chemical space from different chemical representations) regarding physicochemical properties and a region as broad as UNPD-A in terms of the structural features represented by fingerprints.


Subject(s)
Biological Products , Biological Products/chemistry , Molecular Structure , Cheminformatics/methods , Carbon-13 Magnetic Resonance Spectroscopy
5.
J Chem Inf Model ; 64(19): 7189-7213, 2024 Oct 14.
Article in English | MEDLINE | ID: mdl-39302256

ABSTRACT

A knowledge graph (KG) is a technique for modeling entities and their interrelations. Knowledge graph embedding (KGE) translates these entities and relationships into a continuous vector space to facilitate dense and efficient representations. In the domain of chemistry, applying KG and KGE techniques integrates heterogeneous chemical information into a coherent and user-friendly framework, enhances the representation of chemical data features, and is beneficial for downstream tasks, such as chemical property prediction. This paper begins with a comprehensive review of classical and contemporary KGE methodologies, including distance-based models, semantic matching models, and neural network-based approaches. We then catalogue the primary databases employed in chemistry and biochemistry that furnish the KGs with essential chemical data. Subsequently, we explore the latest applications of KG and KGE in chemistry, focusing on risk assessment, property prediction, and drug discovery. Finally, we discuss the current challenges to KG and KGE techniques and provide a perspective on their potential future developments.


Subject(s)
Neural Networks, Computer , Drug Discovery/methods , Cheminformatics/methods , Databases, Chemical , Humans
6.
J Chem Inf Model ; 64(19): 7303-7312, 2024 Oct 14.
Article in English | MEDLINE | ID: mdl-39321215

ABSTRACT

Analyzing machine learning models, especially nonlinear ones, poses significant challenges. In this context, centered kernel alignment (CKA) has emerged as a promising model analysis tool that assesses the similarity between two embeddings. CKA's efficacy depends on selecting a kernel that adequately captures the underlying properties of the compared models. The model analysis tool was designed for neural networks (NNs) with their invariance to data rotation in mind and has been successfully employed in various scientific domains. However, CKA has rarely been adopted in cheminformatics, partly because of the popularity of the random forest (RF) machine learning algorithm, which is not rotationally invariant. In this work, we present the adaptation of CKA that builds on the RF kernel to match the properties of RF. As part of the method validation, we show that the model analysis method is well-correlated with the prediction similarity of RF models. Furthermore, we demonstrate how CKA with the RF kernel can be utilized to analyze and explain the behavior of RF models derived from molecular and rooted fingerprints.


Subject(s)
Machine Learning , Neural Networks, Computer , Algorithms , Cheminformatics/methods , Models, Molecular
7.
Sci Rep ; 14(1): 20812, 2024 09 06.
Article in English | MEDLINE | ID: mdl-39242880

ABSTRACT

With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models for efficient prediction of query compounds' hepatotoxicity. The predictivity of each of these models was evaluated on a large number of test set compounds. The best-performing model was also used to screen a true external data set. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book ( https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm ), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.


Subject(s)
Chemical and Drug Induced Liver Injury , Quantitative Structure-Activity Relationship , Chemical and Drug Induced Liver Injury/etiology , Algorithms , Machine Learning , Humans , Cheminformatics/methods
8.
Chem Pharm Bull (Tokyo) ; 72(9): 794-799, 2024.
Article in English | MEDLINE | ID: mdl-39218704

ABSTRACT

Recently, remarkable progress has been achieved in artificial intelligence (AI), including machine learning. Various AI models have been proposed for drug discovery, including the design of small molecules, activity prediction, and three-dimensional (3D) structure prediction of proteins. AI consists of diverse elements, including information retrieval and machine learning, and can be used in a wide range of drug discovery scenarios. In this review, we focused on AI for small-molecule drug discovery with respect to molecular design, activity prediction, and prediction of the binding poses of compounds to target molecules. We also discussed the applications of AI in academic drug discovery.


Subject(s)
Artificial Intelligence , Cheminformatics , Drug Discovery , Humans , Machine Learning , Small Molecule Libraries/chemistry
9.
J Vis Exp ; (211)2024 Sep 06.
Article in English | MEDLINE | ID: mdl-39311615

ABSTRACT

Chemical space is a multidimensional descriptor space that encloses all possible molecules, and at least 1 x 1060 organic substances with a molecular weight below 500 Da are thought to be potentially relevant for drug discovery. Natural products have been the primary source of the new pharmacological entities marketed during the past forty years and continue to be one of the most productive sources for the creation of innovative medications. Chemoinformatics-based computational tools accelerate the drug development process for natural products. Methods including estimating bioactivities, safety profiles, ADME, and natural product likeness measurement have been used. Here, we go over recent developments in chemoinformatic tools designed to visualize, characterize, and expand the chemical space of natural compound data sets using various molecular representations, create visual representations of such spaces, and investigate structure-property relationships within chemical spaces. With an emphasis on drug discovery applications, we evaluate the open-source databases BIOFACQUIM and PeruNPDB as proof of concept.


Subject(s)
Biological Products , Drug Discovery , Biological Products/chemistry , Drug Discovery/methods , Cheminformatics/methods , Databases, Chemical
10.
Comput Biol Med ; 180: 108954, 2024 Sep.
Article in English | MEDLINE | ID: mdl-39094327

ABSTRACT

Indoleamine 2,3-dioxygenase (IDO) and tryptophan 2,3-dioxygenase (TDO) are attractive drug targets for cancer immunotherapy. After disappointing results of the epacadostat as a selective IDO inhibitor in phase III clinical trials, there is much interest in the development of the TDO selective inhibitors. In the current study, several data analysis methods and machine learning approaches including logistic regression, Random Forest, XGBoost and Support Vector Machines were used to model a data set of compounds retrieved from ChEMBL. Models based on the Morgan fingerprints revealed notable fragments for the selective inhibition of the IDO, TDO or both. Multiple fragment docking was performed to find the best set of bound fragments and their orientation in the space for efficient linking. Linking the fragments and optimization of the final molecules were accomplished by means of an artificial intelligence generative framework. Finally, selectivity of the optimized molecules was assessed and the top 4 lead molecules were filtered through PAINS, Brenk and NIH filters. Results indicated that phenyloxalamide, fluoroquinoline, and 3-bromo-4-fluroaniline confer selectivity towards the IDO inhibition. Correspondingly, 1-benzyl-1H-naphtho[2,3-d][1,2,3]triazole-4,9-dione was found to be an integral fragment for the selective inhibition of the TDO by constituting a coordination bond with the Fe atom of heme. In addition, furo[2,3-c]pyridine-2,3-diamine was found as a common fragment for inhibition of the both targets and can be used in the design of the dual target inhibitors of the IDO and TDO. The new fragments introduced here can be a useful building blocks for incorporation into the selective TDO or dual IDO/TDO inhibitors.


Subject(s)
Cheminformatics , Enzyme Inhibitors , Indoleamine-Pyrrole 2,3,-Dioxygenase , Machine Learning , Tryptophan Oxygenase , Indoleamine-Pyrrole 2,3,-Dioxygenase/antagonists & inhibitors , Indoleamine-Pyrrole 2,3,-Dioxygenase/chemistry , Indoleamine-Pyrrole 2,3,-Dioxygenase/metabolism , Tryptophan Oxygenase/antagonists & inhibitors , Tryptophan Oxygenase/metabolism , Tryptophan Oxygenase/chemistry , Humans , Cheminformatics/methods , Enzyme Inhibitors/chemistry , Molecular Docking Simulation
11.
Microb Pathog ; 195: 106892, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39216611

ABSTRACT

The highly pathogenic Marburg virus (MARV) is a member of the Filoviridae family, a non-segmented negative-strand RNA virus. This article represents the computer-aided drug design (CADD) approach for identifying drug-like compounds that prevent the MARV virus disease by inhibiting nucleoprotein, which is responsible for their replication. This study used a wide range of in silico drug design techniques to identify potential drugs. Out of 368 natural compounds, 202 compounds passed ADMET, and molecular docking identified the top two molecules (CID: 1804018 and 5280520) with a high binding affinity of -6.77 and -6.672 kcal/mol, respectively. Both compounds showed interactions with the common amino acid residues SER_216, ARG_215, TYR_135, CYS_195, and ILE_108, which indicates that lead compounds and control ligands interact in the common active site/catalytic site of the protein. The negative binding free energies of CID: 1804018 and 5280520 were -66.01 and -31.29 kcal/mol, respectively. Two lead compounds were re-evaluated using MD modeling techniques, which confirmed CID: 1804018 as the most stable when complexed with the target protein. PC3 of the (Z)-2-(2,5-dimethoxybenzylidene)-6-(2-(4-methoxyphenyl)-2-oxoethoxy) benzofuran-3(2H)-one (CID: 1804018) was 8.74 %, whereas PC3 of the 2'-Hydroxydaidzein (CID: 5280520) was 11.25 %. In this study, (Z)-2-(2,5-dimethoxybenzylidene)-6-(2-(4-methoxyphenyl)-2-oxoethoxy) benzofuran-3(2H)-one (CID: 1804018) unveiled the significant stability of the proteins' binding site in ADMET, Molecular docking, MM-GBSA and MD simulation analysis studies, which also showed a high negative binding free energy value, confirming as the best drug candidate which is found in Angelica archangelica which may potentially inhibit the replication of MARV nucleoprotein.


Subject(s)
Antiviral Agents , Benzofurans , Marburgvirus , Molecular Docking Simulation , Virus Replication , Antiviral Agents/pharmacology , Antiviral Agents/chemistry , Antiviral Agents/metabolism , Marburgvirus/drug effects , Marburgvirus/metabolism , Benzofurans/pharmacology , Benzofurans/chemistry , Benzofurans/metabolism , Virus Replication/drug effects , Cheminformatics/methods , Drug Design , Protein Binding , RNA-Binding Proteins/metabolism , RNA-Binding Proteins/chemistry , Binding Sites , Ligands
12.
Molecules ; 29(15)2024 Aug 01.
Article in English | MEDLINE | ID: mdl-39125052

ABSTRACT

Marine natural products (MNPs) continue to be tested primarily in cellular toxicity assays, both mammalian and microbial, despite most being inactive at concentrations relevant to drug discovery. These MNPs become missed opportunities and represent a wasteful use of precious bioresources. The use of cheminformatics aligned with published bioactivity data can provide insights to direct the choice of bioassays for the evaluation of new MNPs. Cheminformatics analysis of MNPs found in MarinLit (n = 39,730) up to the end of 2023 highlighted indol-3-yl-glyoxylamides (IGAs, n = 24) as a group of MNPs with no reported bioactivities. However, a recent review of synthetic IGAs highlighted these scaffolds as privileged structures with several compounds under clinical evaluation. Herein, we report the synthesis of a library of 32 MNP-inspired brominated IGAs (25-56) using a simple one-pot, multistep method affording access to these diverse chemical scaffolds. Directed by a meta-analysis of the biological activities reported for marine indole alkaloids (MIAs) and synthetic IGAs, the brominated IGAs 25-56 were examined for their potential bioactivities against the Parkinson's Disease amyloid protein alpha synuclein (α-syn), antiplasmodial activities against chloroquine-resistant (3D7) and sensitive (Dd2) parasite strains of Plasmodium falciparum, and inhibition of mammalian (chymotrypsin and elastase) and viral (SARS-CoV-2 3CLpro) proteases. All of the synthetic IGAs tested exhibited binding affinity to the amyloid protein α-syn, while some showed inhibitory activities against P. falciparum, and the proteases, SARS-CoV-2 3CLpro, and chymotrypsin. The cellular safety of the IGAs was examined against cancerous and non-cancerous human cell lines, with all of the compounds tested inactive, thereby validating cheminformatics and meta-analyses results. The findings presented herein expand our knowledge of marine IGA bioactive chemical space and advocate expanding the scope of biological assays routinely used to investigate NP bioactivities, specifically those more suitable for non-toxic compounds. By integrating cheminformatics tools and functional assays into NP biological testing workflows, we can aim to enhance the potential of NPs and their scaffolds for future drug discovery and development.


Subject(s)
Biological Products , Cheminformatics , Drug Discovery , Biological Products/chemistry , Biological Products/pharmacology , Humans , Cheminformatics/methods , SARS-CoV-2/drug effects , Aquatic Organisms/chemistry , Indoles/chemistry , Indoles/pharmacology , Plasmodium falciparum/drug effects , Indole Alkaloids/pharmacology , Indole Alkaloids/chemistry , Animals
13.
Biomolecules ; 14(8)2024 Aug 20.
Article in English | MEDLINE | ID: mdl-39199420

ABSTRACT

The development of new treatments for neglected tropical diseases (NTDs) remains a major challenge in the 21st century. In most cases, the available drugs are obsolete and have limitations in terms of efficacy and safety. The situation becomes even more complex when considering the low number of new chemical entities (NCEs) currently in use in advanced clinical trials for most of these diseases. Natural products (NPs) are valuable sources of hits and lead compounds with privileged scaffolds for the discovery of new bioactive molecules. Considering the relevance of biodiversity for drug discovery, a chemoinformatics analysis was conducted on a compound dataset of NPs with anti-trypanosomatid activity reported in 497 research articles from 2019 to 2024. Structures corresponding to different metabolic classes were identified, including terpenoids, benzoic acids, benzenoids, steroids, alkaloids, phenylpropanoids, peptides, flavonoids, polyketides, lignans, cytochalasins, and naphthoquinones. This unique collection of NPs occupies regions of the chemical space with drug-like properties that are relevant to anti-trypanosomatid drug discovery. The gathered information greatly enhanced our understanding of biologically relevant chemical classes, structural features, and physicochemical properties. These results can be useful in guiding future medicinal chemistry efforts for the development of NP-inspired NCEs to treat NTDs caused by trypanosomatid parasites.


Subject(s)
Biodiversity , Biological Products , Cheminformatics , Drug Discovery , Neglected Diseases , Animals , Humans , Biological Products/chemistry , Biological Products/pharmacology , Biological Products/therapeutic use , Cheminformatics/methods , Drug Discovery/methods , Neglected Diseases/drug therapy , Trypanocidal Agents/chemistry , Trypanocidal Agents/pharmacology , Trypanocidal Agents/therapeutic use , Trypanosoma/drug effects
15.
Mol Inform ; 43(8): e202400050, 2024 Aug.
Article in English | MEDLINE | ID: mdl-38979846

ABSTRACT

The exploration of chemical space is a fundamental aspect of chemoinformatics, particularly when one explores a large compound data set to relate chemical structures with molecular properties. In this study, we extend our previous work on chemical space visualization at the pharmacophoric level. Instead of using conventional binary classification of affinity (active vs inactive), we introduce a refined approach that categorizes compounds into four distinct classes based on their activity levels: super active, very active, active, and inactive. This classification enriches the color scheme applied to pharmacophore space, where the color representation of a pharmacophore hypothesis is driven by the associated compounds. Using the BCR-ABL tyrosine kinase as a case study, we identified intriguing regions corresponding to pharmacophore activity discontinuities, providing valuable insights for structure-activity relationships analysis.


Subject(s)
Fusion Proteins, bcr-abl , Protein Kinase Inhibitors , Fusion Proteins, bcr-abl/antagonists & inhibitors , Fusion Proteins, bcr-abl/chemistry , Protein Kinase Inhibitors/chemistry , Protein Kinase Inhibitors/pharmacology , Structure-Activity Relationship , Humans , Cheminformatics/methods , Pharmacophore
16.
J Chem Inf Model ; 64(15): 5888-5899, 2024 Aug 12.
Article in English | MEDLINE | ID: mdl-39009039

ABSTRACT

Chemical information disseminated in scientific documents offers an untapped potential for deep learning-assisted insights and breakthroughs. Automated extraction efforts have shifted from resource-intensive manual extraction toward applying machine learning methods to streamline chemical data extraction. While current extraction models and pipelines have ushered in notable efficiency improvements, they often exhibit modest performance, compromising the accuracy of predictive models trained on extracted data. Further, current chemical pipelines lack both transferability─where a model trained on one task can be adapted to another relevant task with limited examples─and extensibility, which enables seamless adaptability for new extraction tasks. Addressing these gaps, we present ChemREL, a versatile chemical data extraction pipeline emphasizing performance, transferability, and extensibility. ChemREL utilizes a custom, diverse data set of chemical documents, labeled through an active learning strategy to extract two properties: normal melting point and lethal dose 50 (LD50). The normal melting point is selected for its prevalence in diverse contexts and wider literature, serving as the foundation for pipeline training. In contrast, LD50 evaluates the pipeline's transferability to an unrelated property, underscoring variance in its biological nature, toxicological context, and units, among other differences. With pretraining and fine-tuning, our pipeline outperforms existing methods and GPT-4, achieving F1-scores of 96.1% for entity identification and 97.0% for relation mapping, culminating in an overall F1-score of 95.4%. More importantly, ChemREL displays high transferability, effectively transitioning from melting point extraction to LD50 extraction with 10 randomly selected training documents. Released as an open-source package, ChemREL aims to broaden access to chemical data extraction, enabling the construction of expansive relational data sets that propel discovery.


Subject(s)
Deep Learning , Data Mining/methods , Cheminformatics/methods
17.
PLoS One ; 19(7): e0306202, 2024.
Article in English | MEDLINE | ID: mdl-38968199

ABSTRACT

Chemical information has become increasingly ubiquitous and has outstripped the pace of analysis and interpretation. We have developed an R package, uafR, that automates a grueling retrieval process for gas -chromatography coupled mass spectrometry (GC -MS) data and allows anyone interested in chemical comparisons to quickly perform advanced structural similarity matches. Our streamlined cheminformatics workflows allow anyone with basic experience in R to pull out component areas for tentative compound identifications using the best published understanding of molecules across samples (pubchem.gov). Interpretations can now be done at a fraction of the time, cost, and effort it would typically take using a standard chemical ecology data analysis pipeline. The package was tested in two experimental contexts: (1) A dataset of purified internal standards, which showed our algorithms correctly identified the known compounds with R2 values ranging from 0.827-0.999 along concentrations ranging from 1 × 10-5 to 1 × 103 ng/µl, (2) A large, previously published dataset, where the number and types of compounds identified were comparable (or identical) to those identified with the traditional manual peak annotation process, and NMDS analysis of the compounds produced the same pattern of significance as in the original study. Both the speed and accuracy of GC -MS data processing are drastically improved with uafR because it allows users to fluidly interact with their experiment following tentative library identifications [i.e. after the m/z spectra have been matched against an installed chemical fragmentation database (e.g. NIST)]. Use of uafR will allow larger datasets to be collected and systematically interpreted quickly. Furthermore, the functions of uafR could allow backlogs of previously collected and annotated data to be processed by new personnel or students as they are being trained. This is critical as we enter the era of exposomics, metabolomics, volatilomes, and landscape level, high-throughput chemotyping. This package was developed to advance collective understanding of chemical data and is applicable to any research that benefits from GC -MS analysis. It can be downloaded for free along with sample datasets from Github at github.org/castratton/uafR or installed directly from R or RStudio using the developer tools: 'devtools::install_github("castratton/uafR")'.


Subject(s)
Algorithms , Gas Chromatography-Mass Spectrometry , Software , Gas Chromatography-Mass Spectrometry/methods , Cheminformatics/methods
18.
Mol Inform ; 43(7): e202400052, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38994633

ABSTRACT

Compound databases of natural products play a crucial role in drug discovery and development projects and have implications in other areas, such as food chemical research, ecology and metabolomics. Recently, we put together the first version of the Latin American Natural Product database (LANaPDB) as a collective effort of researchers from six countries to ensemble a public and representative library of natural products in a geographical region with a large biodiversity. The present work aims to conduct a comparative and extensive profiling of the natural product-likeness of an updated version of LANaPDB and the individual ten compound databases that form part of LANaPDB. The natural product-likeness profile of the Latin American compound databases is contrasted with the profile of other major natural product databases in the public domain and a set of small-molecule drugs approved for clinical use. As part of the extensive characterization, we employed several chemoinformatics metrics of natural product likeness. The results of this study will capture the attention of the global community engaged in natural product databases, not only in Latin America but across the world.


Subject(s)
Biological Products , Biological Products/chemistry , Biological Products/pharmacology , Latin America , Small Molecule Libraries/pharmacology , Small Molecule Libraries/chemistry , Drug Discovery , Cheminformatics , Databases, Chemical
19.
J Chem Inf Model ; 64(14): 5451-5469, 2024 Jul 22.
Article in English | MEDLINE | ID: mdl-38949069

ABSTRACT

This study addresses the challenge of accurately identifying stereoisomers in cheminformatics, which originates from our objective to apply machine learning to predict the association constant between cyclodextrin and a guest. Identifying stereoisomers is indeed crucial for machine learning applications. Current tools offer various molecular descriptors, including their textual representation as Isomeric SMILES that can distinguish stereoisomers. However, such representation is text-based and does not have a fixed size, so a conversion is needed to make it usable to machine learning approaches. Word embedding techniques can be used to solve this problem. Mol2vec, a word embedding approach for molecules, offers such a conversion. Unfortunately, it cannot distinguish between stereoisomers due to its inability to capture the spatial configuration of molecular structures. This study proposes several approaches that use word embedding techniques to handle molecular discrimination using stereochemical information on molecules or considering Isomeric SMILES notation as a text in Natural Language Processing. Our aim is to generate a distinct vector for each unique molecule, correctly identifying stereoisomer information in cheminformatics. The proposed approaches are then compared to our original machine learning task: predicting the association constant between cyclodextrin and a guest molecule.


Subject(s)
Machine Learning , Stereoisomerism , Cheminformatics/methods , Cyclodextrins/chemistry , Natural Language Processing
20.
J Chem Inf Model ; 64(14): 5521-5534, 2024 Jul 22.
Article in English | MEDLINE | ID: mdl-38950894

ABSTRACT

Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.


Subject(s)
Machine Learning , Data Mining/methods , Databases, Chemical , Algorithms , Cheminformatics/methods
SELECTION OF CITATIONS
SEARCH DETAIL