Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 56
Filtrar
1.
J Chem Inf Model ; 62(14): 3275-3280, 2022 07 25.
Artículo en Inglés | MEDLINE | ID: mdl-35796226

RESUMEN

As with many other institutions, our company maintains many quantitative structure-activity relationship (QSAR) models of absorption, distribution, metabolism, excretion, and toxicity (ADMET) end points and updates the models regularly. We recently examined version-to-version predictivity for these models over a period of 10 years. In this approach we monitor the goodness of prediction of new molecules relative to the training set of model version V before they are incorporated in the updated model V+1. Using a cell-based permeability assay (Papp) as an example, we illustrate how the QSAR models made from this data are generally predictive and can be utilized to enrich chemical designs and synthesis. Despite the obvious utility of these models, we turned up unexpected behavior in Papp and other ADMET activities for which the explanation is not obvious. One such behavior is that the apparent predictivity of the models as measured by root-mean-square-error can vary greatly from version to version and is sometimes very poor. One intuitively appealing explanation is that the observed activities of the new molecules fall outside the bulk of activities in the training set. Alternatively, one may think that the new molecules are exploring different regions of chemical space than the training set. However, the real explanation has to do with activity cliffs. If the observed activities of the new molecules are different than expected based on similar molecules in the training set, the predictions will be less accurate. This is true for all our ADMET end points.


Asunto(s)
Relación Estructura-Actividad Cuantitativa
2.
J Chem Inf Model ; 62(15): 3477-3485, 2022 08 08.
Artículo en Inglés | MEDLINE | ID: mdl-35849796

RESUMEN

As with other pharma companies, we maintain production QSAR models of ADMET end points and update them regularly. Here, for six ADMET end points, we examine the predictions of test set molecules on multiple versions of random forest models spanning a period of 10 years. For any given end point, the predictions for the majority of molecules are similar for all model versions. However, for a small minority of molecules, the prediction shifts substantially over the span of a few versions. For most molecules that shift, the prediction becomes more accurate at later times. This Perspective investigates metrics that can help indicate which molecules will shift substantially in prediction and when the shift will occur.


Asunto(s)
Relación Estructura-Actividad Cuantitativa
3.
J Chem Inf Model ; 60(10): 4653-4663, 2020 10 26.
Artículo en Inglés | MEDLINE | ID: mdl-33022174

RESUMEN

While Gaussian process models are typically restricted to smaller data sets, we propose a variation which extends its applicability to the larger data sets common in the industrial drug discovery space, making it relatively novel in the quantitative structure-activity relationship (QSAR) field. By incorporating locality-sensitive hashing for fast nearest neighbor searches, the nearest neighbor Gaussian process model makes predictions with time complexity that is sub-linear with the sample size. The model can be efficiently built, permitting rapid updates to prevent degradation as new data is collected. Given its small number of hyperparameters, it is robust against overfitting and generalizes about as well as other common QSAR models. Like the usual Gaussian process model, it natively produces principled and well-calibrated uncertainty estimates on its predictions. We compare this new model with implementations of random forest, light gradient boosting, and k-nearest neighbors to highlight these promising advantages. The code for the nearest neighbor Gaussian process is available at https://github.com/Merck/nngp.


Asunto(s)
Descubrimiento de Drogas , Relación Estructura-Actividad Cuantitativa , Análisis por Conglomerados , Distribución Normal
4.
Chem Soc Rev ; 49(11): 3525-3564, 2020 06 07.
Artículo en Inglés | MEDLINE | ID: mdl-32356548

RESUMEN

Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.


Asunto(s)
Química Farmacéutica/métodos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/metabolismo , Preparaciones Farmacéuticas/química , Algoritmos , Animales , Inteligencia Artificial , Bases de Datos Factuales , Diseño de Fármacos , Historia del Siglo XX , Historia del Siglo XXI , Humanos , Modelos Moleculares , Relación Estructura-Actividad Cuantitativa , Teoría Cuántica , Reproducibilidad de los Resultados
5.
6.
J Chem Inf Model ; 60(6): 2773-2790, 2020 06 22.
Artículo en Inglés | MEDLINE | ID: mdl-32250622

RESUMEN

Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Algoritmos , Secuencia de Aminoácidos , Ingeniería de Proteínas
7.
J Chem Inf Model ; 60(4): 1969-1982, 2020 04 27.
Artículo en Inglés | MEDLINE | ID: mdl-32207612

RESUMEN

Given a particular descriptor/method combination, some quantitative structure-activity relationship (QSAR) datasets are very predictive by random-split cross-validation while others are not. Recent literature in modelability suggests that the limiting issue for predictivity is in the data, not the QSAR methodology, and the limits are due to activity cliffs. Here, we investigate, on in-house data, the relative usefulness of experimental error, distribution of the activities, and activity cliff metrics in determining how predictive a dataset is likely to be. We include unmodified in-house datasets, datasets that should be perfectly predictive based only on the chemical structure, datasets where the distribution of activities is manipulated, and datasets that include a known amount of added noise. We find that activity cliff metrics determine predictivity better than the other metrics we investigated, whatever the type of dataset, consistent with the modelability literature. However, such metrics cannot distinguish real activity cliffs due to large uncertainties in the activities. We also show that a number of modern QSAR methods, and some alternative descriptors, are equally bad at predicting the activities of compounds on activity cliffs, consistent with the assumptions behind "modelability." Finally, we relate time-split predictivity with random-split predictivity and show that different coverages of chemical space are at least as important as uncertainty in activity and/or activity cliffs in limiting predictivity.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Error Científico Experimental , Relación Estructura-Actividad , Incertidumbre
9.
J Chem Inf Model ; 59(6): 2642-2655, 2019 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-30998343

RESUMEN

Quantitative structure-activity relationship (QSAR) is a very commonly used technique for predicting the biological activity of a molecule using information contained in the molecular descriptors. The large number of compounds and descriptors and the sparseness of descriptors pose important challenges to traditional statistical methods and machine learning (ML) algorithms (such as random forest (RF)) used in this field. Recently, Bayesian Additive Regression Trees (BART), a flexible Bayesian nonparametric regression approach, has been demonstrated to be competitive with widely used ML approaches. Instead of only focusing on accurate point estimation, BART is formulated entirely in a hierarchical Bayesian modeling framework, allowing one to also quantify uncertainties and hence to provide both point and interval estimation for a variety of quantities of interest. We studied BART as a model builder for QSAR and demonstrated that the approach tends to have predictive performance comparable to RF. More importantly, we investigated BART's natural capability to analyze truncated (or qualified) data, generate interval estimates for molecular activities as well as descriptor importance, and conduct model diagnosis, which could not be easily handled through other approaches.


Asunto(s)
Descubrimiento de Drogas/métodos , Relación Estructura-Actividad Cuantitativa , Algoritmos , Teorema de Bayes , Aprendizaje Automático , Modelos Químicos , Preparaciones Farmacéuticas/química , Análisis de Regresión , Bibliotecas de Moléculas Pequeñas/química
10.
J Chem Inf Model ; 59(4): 1324-1337, 2019 04 22.
Artículo en Inglés | MEDLINE | ID: mdl-30779563

RESUMEN

Most chemists would agree that the ability to interpret a quantitative structure-activity relationship (QSAR) model is as important as the ability of the model to make accurate predictions. One type of interpretation is coloration of atoms in molecules according to the contribution of each atom to the predicted activity, as in "heat maps". The ability to determine which parts of a molecule increase the activity in question and which decrease it should be useful to chemists who want to modify the molecule. For that type of application, we would hope the coloration to not be particularly sensitive to the details of model building. In this Article, we examine a number of aspects of coloration against 20 combinations of descriptors and QSAR methods. We demonstrate that atom-level coloration is much less robust to descriptor/method combinations than cross-validated predictions. Even in ideal cases where the contribution of individual atoms is known, we cannot always recover the important atoms for some descriptor/method combinations. Thus, model interpretation by atom coloration may not be as simple as it first appeared.


Asunto(s)
Simulación por Computador , Relación Estructura-Actividad Cuantitativa , Humanos , Aprendizaje Automático , Flujo de Trabajo
11.
Science ; 362(6416)2018 11 16.
Artículo en Inglés | MEDLINE | ID: mdl-30442777

RESUMEN

We demonstrate that the chemical-feature model described in our original paper is distinguishable from the nongeneralizable models introduced by Chuang and Keiser. Furthermore, the chemical-feature model significantly outperforms these models in out-of-sample predictions, justifying the use of chemical featurization from which machine learning models can extract meaningful patterns in the dataset, as originally described.


Asunto(s)
Aprendizaje Automático , Modelos Químicos
12.
PLoS One ; 13(9): e0203819, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30192891

RESUMEN

The melting temperature (Tm) of a protein is the temperature at which half of the protein population is in a folded state. Therefore, Tm is a measure of the thermostability of a protein. Increasing the Tm of a protein is a critical goal in biotechnology and biomedicine. However, predicting the change in melting temperature (dTm) due to mutations at a single residue is difficult because it depends on an intricate balance of forces. Existing methods for predicting dTm have had similar levels of success using generally complex models. We find that training a machine learning model with a simple set of easy to calculate physicochemical descriptors describing the local environment of the mutation performed as well as more complicated machine learning models and is 2-6 orders of magnitude faster. Importantly, unlike in most previous publications, we perform a blind prospective test on our simple model by designing 96 variants of a protein not in the training set. Results from retrospective and prospective predictions reveal the limited applicability domain of each model. This study highlights the current deficiencies in the available dTm dataset and is a call to the community to systematically design a larger and more diverse experimental dataset of mutants to prospectively predict dTm with greater certainty.


Asunto(s)
Predicción/métodos , Proteínas/química , Temperatura de Transición , Aprendizaje Automático , Modelos Químicos , Mutación , Estabilidad Proteica , Temperatura
13.
Science ; 361(6402)2018 08 10.
Artículo en Inglés | MEDLINE | ID: mdl-29794218

RESUMEN

Understanding the practical limitations of chemical reactions is critically important for efficiently planning the synthesis of compounds in pharmaceutical, agrochemical, and specialty chemical research and development. However, literature reports of the scope of new reactions are often cursory and biased toward successful results, severely limiting the ability to predict reaction outcomes for untested substrates. We herein illustrate strategies for carrying out large-scale surveys of chemical reactivity by using a material-sparing nanomole-scale automated synthesis platform with greatly expanded synthetic scope combined with ultrahigh-throughput matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (MALDI-TOF MS).

14.
Drug Discov Today ; 23(1): 151-160, 2018 01.
Artículo en Inglés | MEDLINE | ID: mdl-28917822

RESUMEN

Increasing amounts of biological data are accumulating in the pharmaceutical industry and academic institutions. However, data does not equal actionable information, and guidelines for appropriate data capture, harmonization, integration, mining, and visualization need to be established to fully harness its potential. Here, we describe ongoing efforts at Merck & Co. to structure data in the area of chemogenomics. We are integrating complementary data from both internal and external data sources into one chemogenomics database (Chemical Genetic Interaction Enterprise; CHEMGENIE). Here, we demonstrate how this well-curated database facilitates compound set design, tool compound selection, target deconvolution in phenotypic screening, and predictive model building.


Asunto(s)
Bases de Datos Factuales , Descubrimiento de Drogas , Genómica , Modelos Teóricos , Fenotipo
15.
J Chem Inf Model ; 57(10): 2490-2504, 2017 10 23.
Artículo en Inglés | MEDLINE | ID: mdl-28872869

RESUMEN

Deep neural networks (DNNs) are complex computational models that have found great success in many artificial intelligence applications, such as computer vision1,2 and natural language processing.3,4 In the past four years, DNNs have also generated promising results for quantitative structure-activity relationship (QSAR) tasks.5,6 Previous work showed that DNNs can routinely make better predictions than traditional methods, such as random forests, on a diverse collection of QSAR data sets. It was also found that multitask DNN models-those trained on and predicting multiple QSAR properties simultaneously-outperform DNNs trained separately on the individual data sets in many, but not all, tasks. To date there has been no satisfactory explanation of why the QSAR of one task embedded in a multitask DNN can borrow information from other unrelated QSAR tasks. Thus, using multitask DNNs in a way that consistently provides a predictive advantage becomes a challenge. In this work, we explored why multitask DNNs make a difference in predictive performance. Our results show that during prediction a multitask DNN does borrow "signal" from molecules with similar structures in the training sets of the other tasks. However, whether this borrowing leads to better or worse predictive performance depends on whether the activities are correlated. On the basis of this, we have developed a strategy to use multitask DNNs that incorporate prior domain knowledge to select training sets with correlated activities, and we demonstrate its effectiveness on several examples.


Asunto(s)
Modelos Químicos , Redes Neurales de la Computación , Proteínas/química , Relación Estructura-Actividad Cuantitativa , Inteligencia Artificial , Simulación por Computador , Sistemas de Liberación de Medicamentos
16.
J Chem Inf Model ; 57(8): 2068-2076, 2017 08 28.
Artículo en Inglés | MEDLINE | ID: mdl-28692267

RESUMEN

Multitask deep learning has emerged as a powerful tool for computational drug discovery. However, despite a number of preliminary studies, multitask deep networks have yet to be widely deployed in the pharmaceutical and biotech industries. This lack of acceptance stems from both software difficulties and lack of understanding of the robustness of multitask deep networks. Our work aims to resolve both of these barriers to adoption. We introduce a high-quality open-source implementation of multitask deep networks as part of the DeepChem open-source platform. Our implementation enables simple python scripts to construct, fit, and evaluate sophisticated deep models. We use our implementation to analyze the performance of multitask deep networks and related deep models on four collections of pharmaceutical data (three of which have not previously been analyzed in the literature). We split these data sets into train/valid/test using time and neighbor splits to test multitask deep learning performance under challenging conditions. Our results demonstrate that multitask deep networks are surprisingly robust and can offer strong improvement over random forests. Our analysis and open-source implementation in DeepChem provide an argument that multitask deep networks are ready for widespread use in commercial drug discovery.


Asunto(s)
Descubrimiento de Drogas/métodos , Aprendizaje Automático , Absorción de Radiación , Concentración 50 Inhibidora , Inhibidores de Proteínas Quinasas/química , Inhibidores de Proteínas Quinasas/farmacología , Inhibidores de Serina Proteinasa/química , Inhibidores de Serina Proteinasa/farmacología , Programas Informáticos , Rayos Ultravioleta
17.
J Med Chem ; 60(16): 6771-6780, 2017 08 24.
Artículo en Inglés | MEDLINE | ID: mdl-28418656

RESUMEN

High-throughput screening (HTS) has enabled millions of compounds to be assessed for biological activity, but challenges remain in the prioritization of hit series. While biological, absorption, distribution, metabolism, excretion, and toxicity (ADMET), purity, and structural data are routinely used to select chemical matter for further follow-up, the scarcity of historical ADMET data for screening hits limits our understanding of early hit compounds. Herein, we describe a process that utilizes a battery of in-house quantitative structure-activity relationship (QSAR) models to generate in silico ADMET profiles for hit series to enable more complete characterizations of HTS chemical matter. These profiles allow teams to quickly assess hit series for desirable ADMET properties or suspected liabilities that may require significant optimization. Accordingly, these in silico data can direct ADMET experimentation and profoundly impact the progression of hit series. Several prospective examples are presented to substantiate the value of this approach.


Asunto(s)
Descubrimiento de Drogas/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Preparaciones Farmacéuticas/química , Miembro 1 de la Subfamilia B de Casetes de Unión a ATP/metabolismo , Animales , Simulación por Computador , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Humanos , Preparaciones Farmacéuticas/metabolismo , Farmacocinética , Farmacología , Relación Estructura-Actividad Cuantitativa
18.
J Chem Inf Model ; 56(12): 2353-2360, 2016 12 27.
Artículo en Inglés | MEDLINE | ID: mdl-27958738

RESUMEN

In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Algoritmos , Bases de Datos Farmacéuticas , Descubrimiento de Drogas , Humanos , Modelos Biológicos , Programas Informáticos
19.
Molecules ; 21(10)2016 Sep 29.
Artículo en Inglés | MEDLINE | ID: mdl-27689987

RESUMEN

We apply matched molecular pair (MMP) analysis to data from ChirBase, which contains literature reports of chromatographic enantioseparations. For the 19 chiral stationary phases we examined, we were able to identify 289 sets of pairs where there is a statistically significant and consistent difference in enantioseparation due to a small chemical change. In many cases these changes highlight enantioselectivity differences between pairs or small families of closely related molecules that have for many years been used to probe the mechanisms of chromatographic chiral recognition; for example, the comparison of N-H vs. N-Me analytes to determine the criticality of an N-H hydrogen bond in chiral molecular recognition. In other cases, statistically significant MMPs surfaced by the analysis are less familiar or somewhat puzzling, sparking a need to generate and test hypotheses to more fully understand. Consequently, mining of appropriate datasets using MMP analysis provides an important new approach for studying and understanding the process of chromatographic enantioseparation.

20.
J Chem Inf Model ; 56(11): 2253-2262, 2016 11 28.
Artículo en Inglés | MEDLINE | ID: mdl-27766848

RESUMEN

Several papers have appeared in which a ligand efficiency index instead of pIC50 is used as the activity in QSAR. The claim is that better fits and predictions are obtained with ligand efficiency. We show on both public-domain and in-house data sets that the apparent superiority is a statistical artifact that occurs when ligand efficiency indices are correlated with the physical property included in their definition (number of non-hydrogens, ALOGP, TPSA, etc.) and when the property is easier to predict than the original pIC50.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Humanos , Concentración 50 Inhibidora , Ligandos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...