Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Sci Data ; 11(1): 303, 2024 Mar 18.
Article in English | MEDLINE | ID: mdl-38499581

ABSTRACT

Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.

2.
SAR QSAR Environ Res ; 32(3): 207-219, 2021 Mar.
Article in English | MEDLINE | ID: mdl-33601989

ABSTRACT

In this article, we consider cross-validation of the quantitative structure-property relationship models for reactions and show that the conventional k-fold cross-validation (CV) procedure gives an 'optimistically' biased assessment of prediction performance. To address this issue, we suggest two strategies of model cross-validation, 'transformation-out' CV, and 'solvent-out' CV. Unlike the conventional k-fold cross-validation approach that does not consider the nature of objects, the proposed procedures provide an unbiased estimation of the predictive performance of the models for novel types of structural transformations in chemical reactions and reactions going under new conditions. Both the suggested strategies have been applied to predict the rate constants of bimolecular elimination and nucleophilic substitution reactions, and Diels-Alder cycloaddition. All suggested cross-validation methodologies and tutorial are implemented in the open-source software package CIMtools (https://github.com/cimm-kzn/CIMtools).


Subject(s)
Models, Chemical , Quantitative Structure-Activity Relationship , Software , Validation Studies as Topic
3.
SAR QSAR Environ Res ; 32(2): 111-131, 2021 Feb.
Article in English | MEDLINE | ID: mdl-33461329

ABSTRACT

This paper is devoted to the analysis of available experimental data and preparation of predictive models for binding affinity of molecules with respect to two nuclear receptors involved in endocrine disruption (ED): the oestrogen (ER) and the androgen (AR) receptors. The ED-relevant data were retrieved from multiple sources, including the CERAPP, CoMPARA, and the Tox21 projects as well as ChEMBL and PubChem databases. Data analysis performed with the help of generative topographic mapping revealed the problem of low agreement between experimental values from different sources. Collected data were used to train both classification models for ER and AR binding activities and regression models for relative binding affinity (RBA) and median inhibition concentration (IC50). These models displayed relatively poor performance in classification (sensitivities ER = 0.34, AR = 0.49) and in regression (determination coefficient r 2 for the RBA and IC50 models in external validation varied from 0.44 to 0.76). Our analysis demonstrates that low models' performance resulted from misinterpreted experimental endpoints or wrongly reported values, thus confirming the observations reported in CERAPP and CoMPARA studies. Developed models and collected data sets included of 6215 (ER) and 3789 (AR) unique compounds, which are freely available.


Subject(s)
Endocrine Disruptors/chemistry , Quantitative Structure-Activity Relationship , Receptors, Androgen/chemistry , Receptors, Estrogen/chemistry , Humans , Models, Theoretical
4.
SAR QSAR Environ Res ; 31(9): 655-675, 2020 Sep.
Article in English | MEDLINE | ID: mdl-32799684

ABSTRACT

We report new consensus models estimating acute toxicity for algae, Daphnia and fish endpoints. We assembled a large collection of 3680 public unique compounds annotated by, at least, one experimental value for the given endpoint. Support Vector Machine models were internally and externally validated following the OECD principles. Reasonable predictive performances were achieved (RMSEext = 0.56-0.78) which are in line with those of state-of-the-art models. The known structural alerts are compared with analysis of the atomic contributions to these models obtained using the ISIDA/ColorAtom utility. A benchmarking against existing tools has been carried out on a set of compounds considered more representative and relevant for the chemical space of the current chemical industry. Our model scored one of the best accuracy and data coverage. Nevertheless, industrial data performances were noticeably lower than those on public data, indicating that existing models fail to meet the industrial needs. Thus, final models were updated with the inclusion of new industrial compounds, extending the applicability domain and relevance for application in an industrial context. Generated models and collected public data are made freely available.


Subject(s)
Daphnia/drug effects , Fishes , Microalgae/drug effects , Quantitative Structure-Activity Relationship , Toxicity Tests, Acute , Water Pollutants, Chemical/toxicity , Animals , Support Vector Machine
5.
SAR QSAR Environ Res ; 31(8): 597-613, 2020 Aug.
Article in English | MEDLINE | ID: mdl-32646236

ABSTRACT

Here we report a new predictive model for autoignition temperature (AIT), an important physical parameter widely used to assess potential safety hazards of combustible materials. Available structure-AIT data extracted from different sources were critically analysed. Support vector regression (SVR) models on different data subsets were built in order to identify a reliable compound set on which a realistic model could be built. This led to a selection of the dataset containing 875 compounds annotated with AIT values. The thereupon-based SVR model performs reasonably well in cross-validation with the determination coefficient r 2 = 0.77 and mean absolute error MAE = 37.8°C. External validation on 20 industrial compounds missing in the training set confirmed its good predictive power (MAE = 28.7°C).


Subject(s)
Fires , Quantitative Structure-Activity Relationship , Temperature , Chemical Phenomena , Data Analysis , Models, Chemical
6.
SAR QSAR Environ Res ; 31(7): 493-510, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32588650

ABSTRACT

The evaluation of persistency of chemicals in environmental media (water, soil, sediment) is included in European Regulations, in the context of the Persistence, Bioaccumulation and Toxicity (PBT) assessment. In silico predictions are valuable alternatives for compounds screening and prioritization. However, already existing prediction tools have limitations: narrow applicability domains due to their relatively small training sets, and lack of medium-specific models. A dataset of 1579 unique compounds has been collected, merging several persistence data sources annotated by, at least, one experimental dissipation half-life value for the given environmental medium. This dataset was used to train binary classification models discriminating persistent/non-persistent (P/nP) compounds based on REACH half-life thresholds on sediment, water and soil compartments. Models were built using ISIDA (In SIlico design and Data Analysis) fragment descriptors and support vector regression, random forest and naïve Bayesian machine-learning methods. All models scored satisfactory performances: sediment being the most performing one (BAext = 0.91), followed by water (BAext = 0.77) and soil (BAext = 0.76). The latter suffer from low detection of persistent ('P') compounds (Snext = 0.50), reflecting discrepancies in reported half-life measurements among the different data sources. Generated models and collected data are made publicly available.


Subject(s)
Environmental Pollutants/pharmacology , Quantitative Structure-Activity Relationship , Bayes Theorem , Computer Simulation , Environmental Pollutants/chemistry , Half-Life , Models, Chemical , Support Vector Machine
7.
SAR QSAR Environ Res ; 31(3): 171-186, 2020 Mar.
Article in English | MEDLINE | ID: mdl-31858821

ABSTRACT

The European Registration, Evaluation, Authorization and Restriction of Chemical Substances Regulation, requires marketed chemicals to be evaluated for Ready Biodegradability (RB), considering in silico prediction as valid alternative to experimental testing. However, currently available models may not be relevant to predict compounds of industrial interest, due to accuracy and applicability domain restriction issues. In this work, we present a new and extended RB dataset (2830 compounds), issued by the merging of several public data sources. It was used to train classification models, which were externally validated and benchmarked against already-existing tools on a set of 316 compounds coming from the industrial context. New models showed good performances in terms of predictive power (Balance Accuracy (BA) = 0.74-0.79) and data coverage (83-91%). The Generative Topographic Mapping approach identified several chemotypes and structural motifs unique to the industrial dataset, highlighting for which chemical classes currently available models may have less reliable predictions. Finally, public and industrial data were merged into global dataset containing 3146 compounds. This is the biggest dataset reported in the literature so far, covering some chemotypes absent in the public data. Thus, predictive model developed on the Global dataset has larger applicability domain than the existing ones.


Subject(s)
Databases, Chemical , Environmental Pollutants/chemistry , Models, Chemical , Algorithms , Benchmarking , Biodegradation, Environmental , Computer Simulation , Databases, Chemical/standards , Quantitative Structure-Activity Relationship , Reproducibility of Results
8.
SAR QSAR Environ Res ; 30(12): 879-897, 2019 Dec.
Article in English | MEDLINE | ID: mdl-31607169

ABSTRACT

We report predictive models of acute oral systemic toxicity representing a follow-up of our previous work in the framework of the NICEATM project. It includes the update of original models through the addition of new data and an external validation of the models using a dataset relevant for the chemical industry context. A regression model for LD50 and multi-class classification model for toxicity classes according to the Global Harmonized System categories were prepared. ISIDA descriptors were used to encode molecular structures. Machine learning algorithms included support vector machine (SVM), random forest (RF) and naïve Bayesian. Selected individual models were combined in consensus. The different datasets were compared using the generative topographic mapping approach. It appeared that the NICEATM datasets were lacking some relevant chemotypes for chemical industry. The new models trained on enlarged data sets have applicability domains (AD) sufficiently large to accommodate industrial compounds. The fraction of compounds inside the models' AD increased from 58% (NICEATM model) to 94% (new model). The increase of training sets improved models' prediction performance: RMSE values decreased from 0.56 to 0.47 and balanced accuracies increased from 0.69 to 0.71 for NICEATM and new models, respectively.


Subject(s)
Animal Testing Alternatives/methods , Models, Theoretical , Toxicity Tests, Acute/methods , Administration, Oral , Animal Testing Alternatives/standards , Animals , Computer Simulation , Consensus , Databases, Chemical , Machine Learning , Quantitative Structure-Activity Relationship , Rats , Reproducibility of Results , Toxicity Tests, Acute/standards
9.
SAR QSAR Environ Res ; 30(7): 507-524, 2019 Jul.
Article in English | MEDLINE | ID: mdl-31244346

ABSTRACT

The bioconcentration factor (BCF), a key parameter required by the REACH regulation, estimates the tendency for a xenobiotic to concentrate inside living organisms. In silico methods can be valid alternatives to costly data measurements. However, in the industrial context, these theoretical approaches may fail to predict BCF with reasonable accuracy. We analyzed whether models built on public data only have adequate performances when challenged to predict industrial compounds. A new set of 1129 compounds has been collected by merging publicly available datasets. Generative Topographic Mapping was employed to compare this chemical space with a set of new compounds issued from the industry. Some new chemotypes absent in the training set (such as siloxanes) have been detected. A new BCF model has been built using ISIDA (In SIlico design and Data Analysis) fragment descriptors, support vector regression and random forest machine-learning methods. It has been externally validated on: (i) collected data from the literature and (ii) industrial data. The latter also served as benchmark for the freely available tools VEGA, EPISuite, TEST, OPERA. New model performs (RMSE of 0.58 log BCF units) comparably to existing ones but benefits of an extended applicability, covering the industrial set chemical space (78% data coverage).


Subject(s)
Computer Simulation , Quantitative Structure-Activity Relationship , Water Pollutants, Chemical/chemistry , Xenobiotics/chemistry , Animals , Food Chain , Machine Learning , Support Vector Machine , Water Pollutants, Chemical/metabolism , Xenobiotics/metabolism
10.
Mol Inform ; 38(10): e1900014, 2019 10.
Article in English | MEDLINE | ID: mdl-31166649

ABSTRACT

We report the building, validation and release of QSPR (Quantitative Structure Property Relationship) models aiming to guide the design of new solvents for the next generation of Li-ion batteries. The dataset compiled from the literature included oxidation potentials (Eox ), specific ionic conductivities (κ), melting points (Tm ) and boiling points (Tb ) for 103 electrolytes. Each of the resulting consensus models assembled 9-19 individual Support Vector Machine models built on different sets of ISIDA fragment descriptors.(1) They were implemented in the ISIDA/Predictor software. Developed models were used to screen a virtual library of 9965 esters and sulfones. The most promising compounds prioritized according to theoretically estimated properties were synthesized and experimentally tested.


Subject(s)
Computer Simulation , Drug Evaluation, Preclinical , Electrolytes/chemistry , Electrolytes/chemical synthesis , Solvents/chemistry , Solvents/chemical synthesis , Electric Conductivity , Electric Power Supplies , Electrochemical Techniques , Electrolytes/analysis , Esters/chemical synthesis , Esters/chemistry , Lithium/chemistry , Models, Molecular , Molecular Structure , Quantitative Structure-Activity Relationship , Software , Solvents/analysis , Sulfones/chemical synthesis , Sulfones/chemistry , Support Vector Machine
11.
J Comput Aided Mol Des ; 32(3): 401-414, 2018 03.
Article in English | MEDLINE | ID: mdl-29380104

ABSTRACT

We report the first direct QSPR modeling of equilibrium constants of tautomeric transformations (logK T ) in different solvents and at different temperatures, which do not require intermediate assessment of acidity (basicity) constants for all tautomeric forms. The key step of the modeling consisted in the merging of two tautomers in one sole molecular graph ("condensed reaction graph") which enables to compute molecular descriptors characterizing entire equilibrium. The support vector regression method was used to build the models. The training set consisted of 785 transformations belonging to 11 types of tautomeric reactions with equilibrium constants measured in different solvents and at different temperatures. The models obtained perform well both in cross-validation (Q2 = 0.81 RMSE = 0.7 logK T units) and on two external test sets. Benchmarking studies demonstrate that our models outperform results obtained with DFT B3LYP/6-311 ++ G(d,p) and ChemAxon Tautomerizer applicable only in water at room temperature.


Subject(s)
Computer Simulation , Solvents/chemistry , Temperature , Isomerism , Molecular Structure , Quantitative Structure-Activity Relationship , Thermodynamics , Water/chemistry
12.
Mol Inform ; 36(10)2017 10.
Article in English | MEDLINE | ID: mdl-28902973

ABSTRACT

Here, we describe an algorithm to visualize chemical structures on a grid-based layout in such a way that similar structures are neighboring. It is based on structure reordering with the help of the Hilbert Schmidt Independence Criterion, representing an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator. The method can be applied to any layout of bi- or three-dimensional shape. The approach is demonstrated on a set of dopamine D5 ligands visualized on squared, disk and spherical layouts.


Subject(s)
Receptors, Dopamine D5/chemistry , Algorithms , Computer Graphics , Computer Simulation , Signal Transduction , User-Computer Interface
13.
J Chem Inf Model ; 55(2): 239-50, 2015 Feb 23.
Article in English | MEDLINE | ID: mdl-25588070

ABSTRACT

A generic chemical transformation may often be achieved under various synthetic conditions. However, for any specific reagents, only one or a few among the reported synthetic protocols may be successful. For example, Michael ß-addition reactions may proceed under different choices of solvent (e.g., hydrophobic, aprotic polar, protic) and catalyst (e.g., Brønsted acid, Lewis acid, Lewis base, etc.). Chemoinformatics methods could be efficiently used to establish a relationship between the reagent structures and the required reaction conditions, which would allow synthetic chemists to waste less time and resources in trying out various protocols in search for the appropriate one. In order to address this problem, a number of 2-classes classification models have been built on a set of 198 Michael reactions retrieved from literature. Trained models discriminate between processes that are compatible and respectively processes not feasible under a specific reaction condition option (feasible or not with a Lewis acid catalyst, feasible or not in hydrophobic solvent, etc.). Eight distinct models were built to decide the compatibility of a Michael addition process with each considered reaction condition option, while a ninth model was aimed to predict whether the assumed Michael addition is feasible at all. Different machine-learning methods (Support Vector Machine, Naive Bayes, and Random Forest) in combination with different types of descriptors (ISIDA fragments issued from Condensed Graphs of Reactions, MOLMAP, Electronic Effect Descriptors, and Chemistry Development Kit computed descriptors) have been used. Models have good predictive performance in 3-fold cross-validation done three times: balanced accuracy varies from 0.7 to 1. Developed models are available for the users at http://infochim.u-strasbg.fr/webserv/VSEngine.html . Eventually, these were challenged to predict feasibility conditions for ∼50 novel Michael reactions from the eNovalys database (originally from patent literature).


Subject(s)
Chemistry, Organic/methods , Expert Systems , Algorithms , Bayes Theorem , Catalysis , Databases, Factual , Indicators and Reagents , Informatics , Machine Learning , Models, Chemical , Predictive Value of Tests , Quantitative Structure-Activity Relationship , Reproducibility of Results , Support Vector Machine
14.
Mol Inform ; 34(6-7): 348-56, 2015 06.
Article in English | MEDLINE | ID: mdl-27490381

ABSTRACT

In this paper we demonstrate that Generative Topographic Mapping (GTM), a machine learning method traditionally used for data visualisation, can be efficiently applied to QSAR modelling using probability distribution functions (PDF) computed in the latent 2-dimensional space. Several different scenarios of the activity assessment were considered: (i) the "activity landscape" approach based on direct use of PDF, (ii) QSAR models involving GTM-generated on descriptors derived from PDF, and, (iii) the k-Nearest Neighbours approach in 2D latent space. Benchmarking calculations were performed on five different datasets: stability constants of metal cations Ca(2+) , Gd(3+) and Lu(3+) complexes with organic ligands in water, aqueous solubility and activity of thrombin inhibitors. It has been shown that the performance of GTM-based regression models is similar to that obtained with some popular machine-learning methods (random forest, k-NN, M5P regression tree and PLS) and ISIDA fragment descriptors. By comparing GTM activity landscapes built both on predicted and experimental activities, we may visually assess the model's performance and identify the areas in the chemical space corresponding to reliable predictions. The applicability domain used in this work is based on data likelihood. Its application has significantly improved the model performances for 4 out of 5 datasets.


Subject(s)
Calcium/chemistry , Gadolinium/chemistry , Lutetium/chemistry , Machine Learning , Models, Chemical , Thrombin/chemistry , Databases, Chemical , Humans
15.
J Comput Aided Mol Des ; 27(8): 675-9, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23963658

ABSTRACT

The goal of this paper is to estimate the number of realistic drug-like molecules which could ever be synthesized. Unlike previous studies based on exhaustive enumeration of molecular graphs or on combinatorial enumeration preselected fragments, we used results of constrained graphs enumeration by Reymond to establish a correlation between the number of generated structures (M) and the number of heavy atoms (N): logM = 0.584 × N × logN + 0.356. The number of atoms limiting drug-like chemical space of molecules which follow Lipinsky's rules (N = 36) has been obtained from the analysis of the PubChem database. This results in M ≈ 10³³ which is in between the numbers estimated by Ertl (10²³) and by Bohacek (106°).


Subject(s)
Databases, Pharmaceutical , Pharmaceutical Preparations/chemistry , Algorithms , Molecular Structure
16.
Mol Inform ; 31(3-4): 301-12, 2012 Apr.
Article in English | MEDLINE | ID: mdl-27477099

ABSTRACT

Here, the utility of Generative Topographic Maps (GTM) for data visualization, structure-activity modeling and database comparison is evaluated, on hand of subsets of the Database of Useful Decoys (DUD). Unlike other popular dimensionality reduction approaches like Principal Component Analysis, Sammon Mapping or Self-Organizing Maps, the great advantage of GTMs is providing data probability distribution functions (PDF), both in the high-dimensional space defined by molecular descriptors and in 2D latent space. PDFs for the molecules of different activity classes were successfully used to build classification models in the framework of the Bayesian approach. Because PDFs are represented by a mixture of Gaussian functions, the Bhattacharyya kernel has been proposed as a measure of the overlap of datasets, which leads to an elegant method of global comparison of chemical libraries.

17.
Mol Inform ; 31(6-7): 491-502, 2012 Jul.
Article in English | MEDLINE | ID: mdl-27477467

ABSTRACT

This paper is devoted to the development of methodology for QSPR modeling of mixtures and its application to vapor/liquid equilibrium diagrams for bubble point temperatures of binary liquid mixtures. Two types of special mixture descriptors based on SiRMS and ISIDA approaches were developed. SiRMS-based fragment descriptors involve atoms belonging to both components of the mixture, whereas the ISIDA fragments belong only to one of these components. The models were built on the data set containing the phase diagrams for 167 mixtures represented by different combinations of 67 pure liquids. Consensus models were developed using nonlinear Support Vector Machine (SVM), Associative Neural Networks (ASNN), and Random Forest (RF) approaches. For SVM and ASNN calculations, the ISIDA fragment descriptors were used, whereas Simplex descriptors were employed in RF models. The models have been validated using three different protocols: "Points out", "Mixtures out" and "Compounds out", based on the specific rules to form training/test sets in each fold of cross-validation. A final validation of the models has been performed on an additional set of 94 mixtures represented by combinations of novel 34 compounds and modeling set chemicals with each other. The root mean squared error of predictions for new mixtures of already known liquids does not exceed 5.7 K, which outperforms COSMO-RS models. Developed QSAR methodology can be applied to the modeling of any nonadditive property of binary mixtures (antiviral activities, drug formulation, etc.).

19.
J Phys Chem B ; 115(1): 93-8, 2011 Jan 13.
Article in English | MEDLINE | ID: mdl-21142192

ABSTRACT

This work is devoted to establishing a quantitative structure-property relationship (QSPR) between the chemical structure of ionic liquids (ILs) and their viscosity followed by computer-aided design of new ILs possessing desirable viscosity. The modeling was performed using back-propagation artificial neural networks on a set of 99 ILs at 25 °C, covering a large viscosity range from 3 to 800 cP. The ISIDA fragment descriptors were used to encode molecular structures of ILs. These models were first validated on 23 new ILs from Solvionic company and then used to predict the viscosity of three new ILs which then have been synthesized and tested. The models display high predictive performance in external 5-fold cross validation: determination coefficients R(2) > 0.73 and absolute mean root mean square error < 70 cP. For three ILs synthesized and tested in this work, predicted viscosities are in good qualitative agreement with the experimentally measured ones.

20.
J Comput Aided Mol Des ; 19(9-10): 693-703, 2005.
Article in English | MEDLINE | ID: mdl-16292611

ABSTRACT

Substructural fragments are proposed as a simple and safe way to encode molecular structures in a matrix containing the occurrence of fragments of a given type. The knowledge retrieved from QSPR modelling can also be stored in that matrix in addition to the information about fragments. Complex supramolecular systems (using special bond types) and chemical reactions (represented as Condensed Graphs of Reactions, CGR) can be treated similarly. The efficiency of fragments as descriptors has been demonstrated in QSPR studies of aqueous solubility for a diverse set of organic compounds as well as in the analysis of thermodynamic parameters for hydrogen-bonding in some supramolecular complexes. It has also been shown that CGR may be an interesting opportunity to perform similarity searches for chemical reactions. The relationship between the density of information in descriptors/knowledge matrices and the robustness of QSPR models is discussed.


Subject(s)
Computer Simulation , Models, Chemical , Hydrogen Bonding , Macromolecular Substances , Molecular Structure , Quantitative Structure-Activity Relationship , Solubility , Thermodynamics , Water
SELECTION OF CITATIONS
SEARCH DETAIL
...