Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 57
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
J Chem Inf Model ; 62(15): 3477-3485, 2022 08 08.
Artículo en Inglés | MEDLINE | ID: mdl-35849796

RESUMEN

As with other pharma companies, we maintain production QSAR models of ADMET end points and update them regularly. Here, for six ADMET end points, we examine the predictions of test set molecules on multiple versions of random forest models spanning a period of 10 years. For any given end point, the predictions for the majority of molecules are similar for all model versions. However, for a small minority of molecules, the prediction shifts substantially over the span of a few versions. For most molecules that shift, the prediction becomes more accurate at later times. This Perspective investigates metrics that can help indicate which molecules will shift substantially in prediction and when the shift will occur.


Asunto(s)
Relación Estructura-Actividad Cuantitativa
2.
J Chem Inf Model ; 62(14): 3275-3280, 2022 07 25.
Artículo en Inglés | MEDLINE | ID: mdl-35796226

RESUMEN

As with many other institutions, our company maintains many quantitative structure-activity relationship (QSAR) models of absorption, distribution, metabolism, excretion, and toxicity (ADMET) end points and updates the models regularly. We recently examined version-to-version predictivity for these models over a period of 10 years. In this approach we monitor the goodness of prediction of new molecules relative to the training set of model version V before they are incorporated in the updated model V+1. Using a cell-based permeability assay (Papp) as an example, we illustrate how the QSAR models made from this data are generally predictive and can be utilized to enrich chemical designs and synthesis. Despite the obvious utility of these models, we turned up unexpected behavior in Papp and other ADMET activities for which the explanation is not obvious. One such behavior is that the apparent predictivity of the models as measured by root-mean-square-error can vary greatly from version to version and is sometimes very poor. One intuitively appealing explanation is that the observed activities of the new molecules fall outside the bulk of activities in the training set. Alternatively, one may think that the new molecules are exploring different regions of chemical space than the training set. However, the real explanation has to do with activity cliffs. If the observed activities of the new molecules are different than expected based on similar molecules in the training set, the predictions will be less accurate. This is true for all our ADMET end points.


Asunto(s)
Relación Estructura-Actividad Cuantitativa
3.
4.
Chem Soc Rev ; 49(11): 3525-3564, 2020 06 07.
Artículo en Inglés | MEDLINE | ID: mdl-32356548

RESUMEN

Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.


Asunto(s)
Química Farmacéutica/métodos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/metabolismo , Preparaciones Farmacéuticas/química , Algoritmos , Animales , Inteligencia Artificial , Bases de Datos Factuales , Diseño de Fármacos , Historia del Siglo XX , Historia del Siglo XXI , Humanos , Modelos Moleculares , Relación Estructura-Actividad Cuantitativa , Teoría Cuántica , Reproducibilidad de los Resultados
5.
J Chem Inf Model ; 60(10): 4653-4663, 2020 10 26.
Artículo en Inglés | MEDLINE | ID: mdl-33022174

RESUMEN

While Gaussian process models are typically restricted to smaller data sets, we propose a variation which extends its applicability to the larger data sets common in the industrial drug discovery space, making it relatively novel in the quantitative structure-activity relationship (QSAR) field. By incorporating locality-sensitive hashing for fast nearest neighbor searches, the nearest neighbor Gaussian process model makes predictions with time complexity that is sub-linear with the sample size. The model can be efficiently built, permitting rapid updates to prevent degradation as new data is collected. Given its small number of hyperparameters, it is robust against overfitting and generalizes about as well as other common QSAR models. Like the usual Gaussian process model, it natively produces principled and well-calibrated uncertainty estimates on its predictions. We compare this new model with implementations of random forest, light gradient boosting, and k-nearest neighbors to highlight these promising advantages. The code for the nearest neighbor Gaussian process is available at https://github.com/Merck/nngp.


Asunto(s)
Descubrimiento de Drogas , Relación Estructura-Actividad Cuantitativa , Análisis por Conglomerados , Distribución Normal
6.
J Chem Inf Model ; 60(4): 1969-1982, 2020 04 27.
Artículo en Inglés | MEDLINE | ID: mdl-32207612

RESUMEN

Given a particular descriptor/method combination, some quantitative structure-activity relationship (QSAR) datasets are very predictive by random-split cross-validation while others are not. Recent literature in modelability suggests that the limiting issue for predictivity is in the data, not the QSAR methodology, and the limits are due to activity cliffs. Here, we investigate, on in-house data, the relative usefulness of experimental error, distribution of the activities, and activity cliff metrics in determining how predictive a dataset is likely to be. We include unmodified in-house datasets, datasets that should be perfectly predictive based only on the chemical structure, datasets where the distribution of activities is manipulated, and datasets that include a known amount of added noise. We find that activity cliff metrics determine predictivity better than the other metrics we investigated, whatever the type of dataset, consistent with the modelability literature. However, such metrics cannot distinguish real activity cliffs due to large uncertainties in the activities. We also show that a number of modern QSAR methods, and some alternative descriptors, are equally bad at predicting the activities of compounds on activity cliffs, consistent with the assumptions behind "modelability." Finally, we relate time-split predictivity with random-split predictivity and show that different coverages of chemical space are at least as important as uncertainty in activity and/or activity cliffs in limiting predictivity.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Error Científico Experimental , Relación Estructura-Actividad , Incertidumbre
7.
J Chem Inf Model ; 60(6): 2773-2790, 2020 06 22.
Artículo en Inglés | MEDLINE | ID: mdl-32250622

RESUMEN

Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Algoritmos , Secuencia de Aminoácidos , Ingeniería de Proteínas
8.
J Chem Inf Model ; 59(4): 1324-1337, 2019 04 22.
Artículo en Inglés | MEDLINE | ID: mdl-30779563

RESUMEN

Most chemists would agree that the ability to interpret a quantitative structure-activity relationship (QSAR) model is as important as the ability of the model to make accurate predictions. One type of interpretation is coloration of atoms in molecules according to the contribution of each atom to the predicted activity, as in "heat maps". The ability to determine which parts of a molecule increase the activity in question and which decrease it should be useful to chemists who want to modify the molecule. For that type of application, we would hope the coloration to not be particularly sensitive to the details of model building. In this Article, we examine a number of aspects of coloration against 20 combinations of descriptors and QSAR methods. We demonstrate that atom-level coloration is much less robust to descriptor/method combinations than cross-validated predictions. Even in ideal cases where the contribution of individual atoms is known, we cannot always recover the important atoms for some descriptor/method combinations. Thus, model interpretation by atom coloration may not be as simple as it first appeared.


Asunto(s)
Simulación por Computador , Relación Estructura-Actividad Cuantitativa , Humanos , Aprendizaje Automático , Flujo de Trabajo
9.
J Chem Inf Model ; 59(6): 2642-2655, 2019 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-30998343

RESUMEN

Quantitative structure-activity relationship (QSAR) is a very commonly used technique for predicting the biological activity of a molecule using information contained in the molecular descriptors. The large number of compounds and descriptors and the sparseness of descriptors pose important challenges to traditional statistical methods and machine learning (ML) algorithms (such as random forest (RF)) used in this field. Recently, Bayesian Additive Regression Trees (BART), a flexible Bayesian nonparametric regression approach, has been demonstrated to be competitive with widely used ML approaches. Instead of only focusing on accurate point estimation, BART is formulated entirely in a hierarchical Bayesian modeling framework, allowing one to also quantify uncertainties and hence to provide both point and interval estimation for a variety of quantities of interest. We studied BART as a model builder for QSAR and demonstrated that the approach tends to have predictive performance comparable to RF. More importantly, we investigated BART's natural capability to analyze truncated (or qualified) data, generate interval estimates for molecular activities as well as descriptor importance, and conduct model diagnosis, which could not be easily handled through other approaches.


Asunto(s)
Descubrimiento de Drogas/métodos , Relación Estructura-Actividad Cuantitativa , Algoritmos , Teorema de Bayes , Aprendizaje Automático , Modelos Químicos , Preparaciones Farmacéuticas/química , Análisis de Regresión , Bibliotecas de Moléculas Pequeñas/química
10.
J Chem Inf Model ; 57(10): 2490-2504, 2017 10 23.
Artículo en Inglés | MEDLINE | ID: mdl-28872869

RESUMEN

Deep neural networks (DNNs) are complex computational models that have found great success in many artificial intelligence applications, such as computer vision1,2 and natural language processing.3,4 In the past four years, DNNs have also generated promising results for quantitative structure-activity relationship (QSAR) tasks.5,6 Previous work showed that DNNs can routinely make better predictions than traditional methods, such as random forests, on a diverse collection of QSAR data sets. It was also found that multitask DNN models-those trained on and predicting multiple QSAR properties simultaneously-outperform DNNs trained separately on the individual data sets in many, but not all, tasks. To date there has been no satisfactory explanation of why the QSAR of one task embedded in a multitask DNN can borrow information from other unrelated QSAR tasks. Thus, using multitask DNNs in a way that consistently provides a predictive advantage becomes a challenge. In this work, we explored why multitask DNNs make a difference in predictive performance. Our results show that during prediction a multitask DNN does borrow "signal" from molecules with similar structures in the training sets of the other tasks. However, whether this borrowing leads to better or worse predictive performance depends on whether the activities are correlated. On the basis of this, we have developed a strategy to use multitask DNNs that incorporate prior domain knowledge to select training sets with correlated activities, and we demonstrate its effectiveness on several examples.


Asunto(s)
Modelos Químicos , Redes Neurales de la Computación , Proteínas/química , Relación Estructura-Actividad Cuantitativa , Inteligencia Artificial , Simulación por Computador , Sistemas de Liberación de Medicamentos
11.
J Chem Inf Model ; 57(8): 2068-2076, 2017 08 28.
Artículo en Inglés | MEDLINE | ID: mdl-28692267

RESUMEN

Multitask deep learning has emerged as a powerful tool for computational drug discovery. However, despite a number of preliminary studies, multitask deep networks have yet to be widely deployed in the pharmaceutical and biotech industries. This lack of acceptance stems from both software difficulties and lack of understanding of the robustness of multitask deep networks. Our work aims to resolve both of these barriers to adoption. We introduce a high-quality open-source implementation of multitask deep networks as part of the DeepChem open-source platform. Our implementation enables simple python scripts to construct, fit, and evaluate sophisticated deep models. We use our implementation to analyze the performance of multitask deep networks and related deep models on four collections of pharmaceutical data (three of which have not previously been analyzed in the literature). We split these data sets into train/valid/test using time and neighbor splits to test multitask deep learning performance under challenging conditions. Our results demonstrate that multitask deep networks are surprisingly robust and can offer strong improvement over random forests. Our analysis and open-source implementation in DeepChem provide an argument that multitask deep networks are ready for widespread use in commercial drug discovery.


Asunto(s)
Descubrimiento de Drogas/métodos , Aprendizaje Automático , Absorción de Radiación , Concentración 50 Inhibidora , Inhibidores de Proteínas Quinasas/química , Inhibidores de Proteínas Quinasas/farmacología , Inhibidores de Serina Proteinasa/química , Inhibidores de Serina Proteinasa/farmacología , Programas Informáticos , Rayos Ultravioleta
12.
J Chem Inf Model ; 56(11): 2253-2262, 2016 11 28.
Artículo en Inglés | MEDLINE | ID: mdl-27766848

RESUMEN

Several papers have appeared in which a ligand efficiency index instead of pIC50 is used as the activity in QSAR. The claim is that better fits and predictions are obtained with ligand efficiency. We show on both public-domain and in-house data sets that the apparent superiority is a statistical artifact that occurs when ligand efficiency indices are correlated with the physical property included in their definition (number of non-hydrogens, ALOGP, TPSA, etc.) and when the property is easier to predict than the original pIC50.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Humanos , Concentración 50 Inhibidora , Ligandos
13.
J Chem Inf Model ; 56(12): 2353-2360, 2016 12 27.
Artículo en Inglés | MEDLINE | ID: mdl-27958738

RESUMEN

In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Algoritmos , Bases de Datos Farmacéuticas , Descubrimiento de Drogas , Humanos , Modelos Biológicos , Programas Informáticos
14.
Molecules ; 21(10)2016 Sep 29.
Artículo en Inglés | MEDLINE | ID: mdl-27689987

RESUMEN

We apply matched molecular pair (MMP) analysis to data from ChirBase, which contains literature reports of chromatographic enantioseparations. For the 19 chiral stationary phases we examined, we were able to identify 289 sets of pairs where there is a statistically significant and consistent difference in enantioseparation due to a small chemical change. In many cases these changes highlight enantioselectivity differences between pairs or small families of closely related molecules that have for many years been used to probe the mechanisms of chromatographic chiral recognition; for example, the comparison of N-H vs. N-Me analytes to determine the criticality of an N-H hydrogen bond in chiral molecular recognition. In other cases, statistically significant MMPs surfaced by the analysis are less familiar or somewhat puzzling, sparking a need to generate and test hypotheses to more fully understand. Consequently, mining of appropriate datasets using MMP analysis provides an important new approach for studying and understanding the process of chromatographic enantioseparation.

15.
J Chem Inf Model ; 55(6): 1098-107, 2015 Jun 22.
Artículo en Inglés | MEDLINE | ID: mdl-25998559

RESUMEN

UNLABELLED: In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities (an "activity model"). The aim of the field of domain applicability (DA) is to estimate the uncertainty of prediction of a specific molecule on a specific activity model. A number of DA metrics have been proposed in the literature for this purpose. A quantitative model of the prediction uncertainty (an "error model") can be built using one or more of these metrics. A previous publication from our laboratory ( Sheridan , R. P. J. Chem. Inf. MODEL: 2013 , 53 , 2837 - 2850 ) suggested that QSAR methods such as random forest could be used to build error models by fitting unsigned prediction errors against DA metrics. The QSAR paradigm contains two useful techniques: descriptor importance can determine which DA metrics are most useful, and cross-validation can be used to tell which subset of DA metrics is sufficient to estimate the unsigned errors. Previously we studied 10 large, diverse data sets and seven DA metrics. For those data sets for which it is possible to build a significant error model from those seven metrics, only two metrics were sufficient to account for almost all of the information in the error model. These were TREE_SD (the variation of prediction among random forest trees) and PREDICTED (the predicted activity itself). In this paper we show that when data sets are less diverse, as for example in QSAR models of molecules in a single chemical series, these two DA metrics become less important in explaining prediction error, and the DA metric SIMILARITYNEAREST1 (the similarity of the molecule being predicted to the closest training set compound) becomes more important. Our recommendation is that when the mean pairwise similarity (measured with the Carhart AP descriptor and the Dice similarity index) within a QSAR training set is less than 0.5, one can use only TREE_SD, PREDICTED to form the error model, but otherwise one should use TREE_SD, PREDICTED, SIMILARITYNEAREST1.


Asunto(s)
Informática/métodos , Relación Estructura-Actividad Cuantitativa , Bases de Datos Farmacéuticas , Modelos Estadísticos , Incertidumbre
16.
J Chem Inf Model ; 55(2): 263-74, 2015 Feb 23.
Artículo en Inglés | MEDLINE | ID: mdl-25635324

RESUMEN

Neural networks were widely used for quantitative structure-activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, more efficient training algorithms, and advancements in computer hardware. In particular, deep neural nets (DNNs), i.e. neural nets with more than one hidden layer, have found great successes in many applications, such as computer vision and natural language processing. Here we show that DNNs can routinely make better prospective predictions than RF on a set of large diverse QSAR data sets that are taken from Merck's drug discovery effort. The number of adjustable parameters needed for DNNs is fairly large, but our results show that it is not necessary to optimize them for individual data sets, and a single set of recommended parameters can achieve better performance than RF for most of the data sets we studied. The usefulness of the parameters is demonstrated on additional data sets not used in the calibration. Although training DNNs is still computationally intensive, using graphical processing units (GPUs) can make this issue manageable.


Asunto(s)
Redes Neurales de la Computación , Relación Estructura-Actividad Cuantitativa , Algoritmos , Descubrimiento de Drogas , Aprendizaje Automático , Estudios Prospectivos , Máquina de Vectores de Soporte , Flujo de Trabajo
17.
J Chem Inf Model ; 55(2): 231-8, 2015 Feb 23.
Artículo en Inglés | MEDLINE | ID: mdl-25551659

RESUMEN

During drug development, compounds are tested against counterscreens, a panel of off-target activities that would be undesirable for a drug to have. Testing every compound against every counterscreen is generally too costly in terms of time and money, and we need to find a rational way of prioritizing counterscreen testing. Here we present the eCounterscreening paradigm, wherein predictions from QSAR models for counterscreen activity are used to generate a recommendation as to whether a specific compound in a specific project should be tested against a specific counterscreen. The rules behind the recommendations, which can be summarized in a risk-benefit plot specific for a counterscreen/project combination, are based on a previously assembled database of prospective QSAR predictions. The recommendations require two user-defined cutoffs: the level of activity in a specific counterscreen that is considered undesirable and the level of risk the chemist is willing to accept that an undesired counterscreen activity will go undetected. We demonstrate in a simulated prospective experiment that eCounterscreening can be used to postpone a large fraction of counterscreen testing and still have an acceptably low risk of undetected counterscreen activity.


Asunto(s)
Ensayos Analíticos de Alto Rendimiento/métodos , Relación Estructura-Actividad Cuantitativa , Algoritmos , Minería de Datos , Bases de Datos Factuales , Descubrimiento de Drogas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Modelos Químicos , Valor Predictivo de las Pruebas , Medición de Riesgo
19.
J Chem Inf Model ; 54(4): 1083-92, 2014 Apr 28.
Artículo en Inglés | MEDLINE | ID: mdl-24628044

RESUMEN

In the pharmaceutical industry, it is common for large numbers of compounds to be tested for off-target activities. Given a compound synthesized for an on-target project P, what is the best way to predict its off-target activity X? Is it better to use a global quantitative structure-activity relationship (QSAR) model calibrated against all compounds tested for X, or is it better to use a local model for X calibrated against only the set of compounds in project P? The literature is not consistent on this topic, and strong claims have been made for either. One particular idea is that local models will be superior to global models in prospective prediction if one generates many local models and chooses the type of local model that best predicts recent data. We tested this idea via simulated prospective prediction using in-house data involving compounds in 11 projects tested for 9 off-target activities. In our hands, the local model that best predicts the recent past is seldom the local model that is best at predicting the immediate future. Also, the local model that best predicts the recent past is not systematically better than the global model. This means the complexity of having project- or series-specific models for X can be avoided; a single global model for X is sufficient. We suggest that the relative predictivity of global vs local models may depend on the type of chemical descriptor used. Finally, we speculate why, contrary to observation, intuition suggests local models should be superior to global models.


Asunto(s)
Modelos Químicos , Relación Estructura-Actividad Cuantitativa , Calibración
20.
J Chem Inf Model ; 54(6): 1604-16, 2014 Jun 23.
Artículo en Inglés | MEDLINE | ID: mdl-24802889

RESUMEN

This paper brings together the concepts of molecular complexity and crowdsourcing. An exercise was done at Merck where 386 chemists voted on the molecular complexity (on a scale of 1-5) of 2681 molecules taken from various sources: public, licensed, and in-house. The meanComplexity of a molecule is the average over all votes for that molecule. As long as enough votes are cast per molecule, we find meanComplexity is quite easy to model with QSAR methods using only a handful of physical descriptors (e.g., number of chiral centers, number of unique topological torsions, a Wiener index, etc.). The high level of self-consistency of the model (cross-validated R(2) ∼0.88) is remarkable given that our chemists do not agree with each other strongly about the complexity of any given molecule. Thus, the power of crowdsourcing is clearly demonstrated in this case. The meanComplexity appears to be correlated with at least one metric of synthetic complexity from the literature derived in a different way and is correlated with values of process mass intensity (PMI) from the literature and from in-house studies. Complexity can be used to differentiate between in-house programs and to follow a program over time.


Asunto(s)
Colaboración de las Masas , Estructura Molecular , Bases de Datos de Compuestos Químicos , Humanos , Modelos Químicos , Relación Estructura-Actividad Cuantitativa , Estereoisomerismo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA