RESUMEN
The aim of the CADASTER project (CAse Studies on the Development and Application of in Silico Techniques for Environmental Hazard and Risk Assessment) was to exemplify REACH-related hazard assessments for four classes of chemical compound, namely, polybrominated diphenylethers, per and polyfluorinated compounds, (benzo)triazoles, and musks and fragrances. The QSPR-THESAURUS website (http: / /qspr-thesaurus.eu) was established as the project's online platform to upload, store, apply, and also create, models within the project. We overview the main features of the website, such as model upload, experimental design and hazard assessment to support risk assessment, and integration with other web tools, all of which are essential parts of the QSPR-THESAURUS.
Asunto(s)
Sustancias Peligrosas/toxicidad , Internet , Relación Estructura-Actividad Cuantitativa , Medición de Riesgo , Modelos Lineales , Proyectos de Investigación , Vocabulario ControladoRESUMEN
The dimethyl sulfoxide (DMSO) solubility data from Enamine and two UCB pharma compound collections were analyzed using 8 different machine learning methods and 12 descriptor sets. The analyzed data sets were highly imbalanced with 1.7-5.8% nonsoluble compounds. The libraries' enrichment by soluble molecules from the set of 10% of the most reliable predictions was used to compare prediction performances of the methods. The highest accuracies were calculated using a C4.5 decision classification tree, random forest, and associative neural networks. The performances of the methods developed were estimated on individual data sets and their combinations. The developed models provided on average a 2-fold decrease of the number of nonsoluble compounds amid all compounds predicted as soluble in DMSO. However, a 4-9-fold enrichment was observed if only 10% of the most reliable predictions were considered. The structural features influencing compounds to be soluble or nonsoluble in DMSO were also determined. The best models developed with the publicly available Enamine data set are freely available online at http://ochem.eu/article/33409 .
Asunto(s)
Inteligencia Artificial , Bases de Datos Farmacéuticas , Dimetilsulfóxido/química , Informática/métodos , Modelos Lineales , Redes Neurales de la Computación , Reproducibilidad de los Resultados , Solubilidad , Máquina de Vectores de SoporteRESUMEN
The importance of reliable methods for representative sub-sampling in terms of experimental design and risk assessment within the European Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) system is crucial. We developed experimental design approaches, by utilising predicted properties and the 'distance to model' parameter, to estimate the benefits of certain compounds to the quality of a resulting model. A statistical evaluation of four regression data sets and one classification data set showed that the adaptive concept of iteratively refining the representation of the chemical space contributes to a more efficient and more reliable selection in comparison to traditional approaches. The evaluation of compounds with regard to the uncertainty and the correlation of prediction is beneficial, and in particular, for regression data sets of sufficient size, whereas the use of predicted properties to define the chemical space is beneficial for classification models.
Asunto(s)
Sustancias Peligrosas/toxicidad , Análisis de Regresión , Proyectos de Investigación , Medición de Riesgo/métodosRESUMEN
The article presents a Web-based platform for collecting and storing toxicological structural alerts from literature and for virtual screening of chemical libraries to flag potentially toxic chemicals and compounds that can cause adverse side effects. An alert is uniquely identified by a SMARTS template, a toxicological endpoint, and a publication where the alert was described. Additionally, the system allows storing complementary information such as name, comments, and mechanism of action, as well as other data. Most importantly, the platform can be easily used for fast virtual screening of large chemical datasets, focused libraries, or newly designed compounds against the toxicological alerts, providing a detailed profile of the chemicals grouped by structural alerts and endpoints. Such a facility can be used for decision making regarding whether a compound should be tested experimentally, validated with available QSAR models, or eliminated from consideration altogether. The alert-based screening can also be helpful for an easier interpretation of more complex QSAR models. The system is publicly accessible and tightly integrated with the Online Chemical Modeling Environment (OCHEM, http://ochem.eu). The system is open and expandable: any registered OCHEM user can introduce new alerts, browse, edit alerts introduced by other users, and virtually screen his/her data sets against all or selected alerts. The user sets being passed through the structural alerts can be used at OCHEM for other typical tasks: exporting in a wide variety of formats, development of QSAR models, additional filtering by other criteria, etc. The database already contains almost 600 structural alerts for such endpoints as mutagenicity, carcinogenicity, skin sensitization, compounds that undergo metabolic activation, and compounds that form reactive metabolites and, thus, can cause adverse reactions. The ToxAlerts platform is accessible on the Web at http://ochem.eu/alerts, and it is constantly growing.
Asunto(s)
Bases de Datos de Compuestos Químicos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Internet , Preparaciones Farmacéuticas/química , Evaluación Preclínica de Medicamentos , Humanos , Relación Estructura-Actividad CuantitativaRESUMEN
Prediction of CYP450 inhibition activity of small molecules poses an important task due to high risk of drug-drug interactions. CYP1A2 is an important member of CYP450 superfamily and accounts for 15% of total CYP450 presence in human liver. This article compares 80 in-silico QSAR models that were created by following the same procedure with different combinations of descriptors and machine learning methods. The training and test sets consist of 3745 and 3741 inhibitors and noninhibitors from PubChem BioAssay database. A heterogeneous external test set of 160 inhibitors was collected from literature. The studied descriptor sets involve E-state, Dragon and ISIDA SMF descriptors. Machine learning methods involve Associative Neural Networks (ASNN), K Nearest Neighbors (kNN), Random Tree (RT), C4.5 Tree (J48), and Support Vector Machines (SVM). The influence of descriptor selection on model accuracy was studied. The benefits of "bagging" modeling approach were shown. Applicability domain approach was successfully applied in this study and ways of increasing model accuracy through use of applicability domain measures were demonstrated as well as fragment-based model interpretation was performed. The most accurate models in this study achieved values of 83% and 68% correctly classified instances on the internal and external test sets, respectively. The applicability domain approach allowed increasing the prediction accuracy to 90% for 78% of the internal and 17% of the external test sets, respectively. The most accurate models are available online at http://ochem.eu/models/Q5747 .
Asunto(s)
Inteligencia Artificial , Inhibidores del Citocromo P-450 CYP1A2 , Inhibidores Enzimáticos/farmacología , Relación Estructura-Actividad Cuantitativa , Inhibidores Enzimáticos/química , Humanos , Conformación MolecularRESUMEN
The Online Chemical Modeling Environment is a web-based platform that aims to automate and simplify the typical steps required for QSAR modeling. The platform consists of two major subsystems: the database of experimental measurements and the modeling framework. A user-contributed database contains a set of tools for easy input, search and modification of thousands of records. The OCHEM database is based on the wiki principle and focuses primarily on the quality and verifiability of the data. The database is tightly integrated with the modeling framework, which supports all the steps required to create a predictive model: data search, calculation and selection of a vast variety of molecular descriptors, application of machine learning methods, validation, analysis of the model and assessment of the applicability domain. As compared to other similar systems, OCHEM is not intended to re-implement the existing tools or models but rather to invite the original authors to contribute their results, make them publicly available, share them with other users and to become members of the growing research community. Our intention is to make OCHEM a widely used platform to perform the QSPR/QSAR studies online and share it with other users on the Web. The ultimate goal of OCHEM is collecting all possible chemoinformatics tools within one simple, reliable and user-friendly resource. The OCHEM is free for web users and it is available online at http://www.ochem.eu.
Asunto(s)
Bases de Datos Factuales , Internet , Modelos Químicos , Difusión de la Información , Gestión de la Información , Relación Estructura-Actividad Cuantitativa , Interfaz Usuario-ComputadorRESUMEN
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
Asunto(s)
Benchmarking/métodos , Clasificación/métodos , Pruebas de Mutagenicidad/métodos , Relación Estructura-Actividad Cuantitativa , Pruebas de Mutagenicidad/normas , Análisis de Componente PrincipalRESUMEN
The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site.