Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 57
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
J Chem Inf Model ; 62(15): 3477-3485, 2022 08 08.
Artigo em Inglês | MEDLINE | ID: mdl-35849796

RESUMO

As with other pharma companies, we maintain production QSAR models of ADMET end points and update them regularly. Here, for six ADMET end points, we examine the predictions of test set molecules on multiple versions of random forest models spanning a period of 10 years. For any given end point, the predictions for the majority of molecules are similar for all model versions. However, for a small minority of molecules, the prediction shifts substantially over the span of a few versions. For most molecules that shift, the prediction becomes more accurate at later times. This Perspective investigates metrics that can help indicate which molecules will shift substantially in prediction and when the shift will occur.


Assuntos
Relação Quantitativa Estrutura-Atividade
2.
J Chem Inf Model ; 62(14): 3275-3280, 2022 07 25.
Artigo em Inglês | MEDLINE | ID: mdl-35796226

RESUMO

As with many other institutions, our company maintains many quantitative structure-activity relationship (QSAR) models of absorption, distribution, metabolism, excretion, and toxicity (ADMET) end points and updates the models regularly. We recently examined version-to-version predictivity for these models over a period of 10 years. In this approach we monitor the goodness of prediction of new molecules relative to the training set of model version V before they are incorporated in the updated model V+1. Using a cell-based permeability assay (Papp) as an example, we illustrate how the QSAR models made from this data are generally predictive and can be utilized to enrich chemical designs and synthesis. Despite the obvious utility of these models, we turned up unexpected behavior in Papp and other ADMET activities for which the explanation is not obvious. One such behavior is that the apparent predictivity of the models as measured by root-mean-square-error can vary greatly from version to version and is sometimes very poor. One intuitively appealing explanation is that the observed activities of the new molecules fall outside the bulk of activities in the training set. Alternatively, one may think that the new molecules are exploring different regions of chemical space than the training set. However, the real explanation has to do with activity cliffs. If the observed activities of the new molecules are different than expected based on similar molecules in the training set, the predictions will be less accurate. This is true for all our ADMET end points.


Assuntos
Relação Quantitativa Estrutura-Atividade
3.
Chem Soc Rev ; 49(11): 3525-3564, 2020 06 07.
Artigo em Inglês | MEDLINE | ID: mdl-32356548

RESUMO

Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.


Assuntos
Química Farmacêutica/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/metabolismo , Preparações Farmacêuticas/química , Algoritmos , Animais , Inteligência Artificial , Bases de Dados Factuais , Desenho de Fármacos , História do Século XX , História do Século XXI , Humanos , Modelos Moleculares , Relação Quantitativa Estrutura-Atividade , Teoria Quântica , Reprodutibilidade dos Testes
5.
J Chem Inf Model ; 60(10): 4653-4663, 2020 10 26.
Artigo em Inglês | MEDLINE | ID: mdl-33022174

RESUMO

While Gaussian process models are typically restricted to smaller data sets, we propose a variation which extends its applicability to the larger data sets common in the industrial drug discovery space, making it relatively novel in the quantitative structure-activity relationship (QSAR) field. By incorporating locality-sensitive hashing for fast nearest neighbor searches, the nearest neighbor Gaussian process model makes predictions with time complexity that is sub-linear with the sample size. The model can be efficiently built, permitting rapid updates to prevent degradation as new data is collected. Given its small number of hyperparameters, it is robust against overfitting and generalizes about as well as other common QSAR models. Like the usual Gaussian process model, it natively produces principled and well-calibrated uncertainty estimates on its predictions. We compare this new model with implementations of random forest, light gradient boosting, and k-nearest neighbors to highlight these promising advantages. The code for the nearest neighbor Gaussian process is available at https://github.com/Merck/nngp.


Assuntos
Descoberta de Drogas , Relação Quantitativa Estrutura-Atividade , Análise por Conglomerados , Distribuição Normal
6.
J Chem Inf Model ; 60(4): 1969-1982, 2020 04 27.
Artigo em Inglês | MEDLINE | ID: mdl-32207612

RESUMO

Given a particular descriptor/method combination, some quantitative structure-activity relationship (QSAR) datasets are very predictive by random-split cross-validation while others are not. Recent literature in modelability suggests that the limiting issue for predictivity is in the data, not the QSAR methodology, and the limits are due to activity cliffs. Here, we investigate, on in-house data, the relative usefulness of experimental error, distribution of the activities, and activity cliff metrics in determining how predictive a dataset is likely to be. We include unmodified in-house datasets, datasets that should be perfectly predictive based only on the chemical structure, datasets where the distribution of activities is manipulated, and datasets that include a known amount of added noise. We find that activity cliff metrics determine predictivity better than the other metrics we investigated, whatever the type of dataset, consistent with the modelability literature. However, such metrics cannot distinguish real activity cliffs due to large uncertainties in the activities. We also show that a number of modern QSAR methods, and some alternative descriptors, are equally bad at predicting the activities of compounds on activity cliffs, consistent with the assumptions behind "modelability." Finally, we relate time-split predictivity with random-split predictivity and show that different coverages of chemical space are at least as important as uncertainty in activity and/or activity cliffs in limiting predictivity.


Assuntos
Relação Quantitativa Estrutura-Atividade , Erro Científico Experimental , Relação Estrutura-Atividade , Incerteza
7.
J Chem Inf Model ; 60(6): 2773-2790, 2020 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-32250622

RESUMO

Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos , Sequência de Aminoácidos , Engenharia de Proteínas
8.
J Chem Inf Model ; 59(4): 1324-1337, 2019 04 22.
Artigo em Inglês | MEDLINE | ID: mdl-30779563

RESUMO

Most chemists would agree that the ability to interpret a quantitative structure-activity relationship (QSAR) model is as important as the ability of the model to make accurate predictions. One type of interpretation is coloration of atoms in molecules according to the contribution of each atom to the predicted activity, as in "heat maps". The ability to determine which parts of a molecule increase the activity in question and which decrease it should be useful to chemists who want to modify the molecule. For that type of application, we would hope the coloration to not be particularly sensitive to the details of model building. In this Article, we examine a number of aspects of coloration against 20 combinations of descriptors and QSAR methods. We demonstrate that atom-level coloration is much less robust to descriptor/method combinations than cross-validated predictions. Even in ideal cases where the contribution of individual atoms is known, we cannot always recover the important atoms for some descriptor/method combinations. Thus, model interpretation by atom coloration may not be as simple as it first appeared.


Assuntos
Simulação por Computador , Relação Quantitativa Estrutura-Atividade , Humanos , Aprendizado de Máquina , Fluxo de Trabalho
9.
J Chem Inf Model ; 59(6): 2642-2655, 2019 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-30998343

RESUMO

Quantitative structure-activity relationship (QSAR) is a very commonly used technique for predicting the biological activity of a molecule using information contained in the molecular descriptors. The large number of compounds and descriptors and the sparseness of descriptors pose important challenges to traditional statistical methods and machine learning (ML) algorithms (such as random forest (RF)) used in this field. Recently, Bayesian Additive Regression Trees (BART), a flexible Bayesian nonparametric regression approach, has been demonstrated to be competitive with widely used ML approaches. Instead of only focusing on accurate point estimation, BART is formulated entirely in a hierarchical Bayesian modeling framework, allowing one to also quantify uncertainties and hence to provide both point and interval estimation for a variety of quantities of interest. We studied BART as a model builder for QSAR and demonstrated that the approach tends to have predictive performance comparable to RF. More importantly, we investigated BART's natural capability to analyze truncated (or qualified) data, generate interval estimates for molecular activities as well as descriptor importance, and conduct model diagnosis, which could not be easily handled through other approaches.


Assuntos
Descoberta de Drogas/métodos , Relação Quantitativa Estrutura-Atividade , Algoritmos , Teorema de Bayes , Aprendizado de Máquina , Modelos Químicos , Preparações Farmacêuticas/química , Análise de Regressão , Bibliotecas de Moléculas Pequenas/química
10.
J Chem Inf Model ; 57(10): 2490-2504, 2017 10 23.
Artigo em Inglês | MEDLINE | ID: mdl-28872869

RESUMO

Deep neural networks (DNNs) are complex computational models that have found great success in many artificial intelligence applications, such as computer vision1,2 and natural language processing.3,4 In the past four years, DNNs have also generated promising results for quantitative structure-activity relationship (QSAR) tasks.5,6 Previous work showed that DNNs can routinely make better predictions than traditional methods, such as random forests, on a diverse collection of QSAR data sets. It was also found that multitask DNN models-those trained on and predicting multiple QSAR properties simultaneously-outperform DNNs trained separately on the individual data sets in many, but not all, tasks. To date there has been no satisfactory explanation of why the QSAR of one task embedded in a multitask DNN can borrow information from other unrelated QSAR tasks. Thus, using multitask DNNs in a way that consistently provides a predictive advantage becomes a challenge. In this work, we explored why multitask DNNs make a difference in predictive performance. Our results show that during prediction a multitask DNN does borrow "signal" from molecules with similar structures in the training sets of the other tasks. However, whether this borrowing leads to better or worse predictive performance depends on whether the activities are correlated. On the basis of this, we have developed a strategy to use multitask DNNs that incorporate prior domain knowledge to select training sets with correlated activities, and we demonstrate its effectiveness on several examples.


Assuntos
Modelos Químicos , Redes Neurais de Computação , Proteínas/química , Relação Quantitativa Estrutura-Atividade , Inteligência Artificial , Simulação por Computador , Sistemas de Liberação de Medicamentos
11.
J Chem Inf Model ; 57(8): 2068-2076, 2017 08 28.
Artigo em Inglês | MEDLINE | ID: mdl-28692267

RESUMO

Multitask deep learning has emerged as a powerful tool for computational drug discovery. However, despite a number of preliminary studies, multitask deep networks have yet to be widely deployed in the pharmaceutical and biotech industries. This lack of acceptance stems from both software difficulties and lack of understanding of the robustness of multitask deep networks. Our work aims to resolve both of these barriers to adoption. We introduce a high-quality open-source implementation of multitask deep networks as part of the DeepChem open-source platform. Our implementation enables simple python scripts to construct, fit, and evaluate sophisticated deep models. We use our implementation to analyze the performance of multitask deep networks and related deep models on four collections of pharmaceutical data (three of which have not previously been analyzed in the literature). We split these data sets into train/valid/test using time and neighbor splits to test multitask deep learning performance under challenging conditions. Our results demonstrate that multitask deep networks are surprisingly robust and can offer strong improvement over random forests. Our analysis and open-source implementation in DeepChem provide an argument that multitask deep networks are ready for widespread use in commercial drug discovery.


Assuntos
Descoberta de Drogas/métodos , Aprendizado de Máquina , Absorção de Radiação , Concentração Inibidora 50 , Inibidores de Proteínas Quinases/química , Inibidores de Proteínas Quinases/farmacologia , Inibidores de Serina Proteinase/química , Inibidores de Serina Proteinase/farmacologia , Software , Raios Ultravioleta
12.
J Chem Inf Model ; 56(11): 2253-2262, 2016 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-27766848

RESUMO

Several papers have appeared in which a ligand efficiency index instead of pIC50 is used as the activity in QSAR. The claim is that better fits and predictions are obtained with ligand efficiency. We show on both public-domain and in-house data sets that the apparent superiority is a statistical artifact that occurs when ligand efficiency indices are correlated with the physical property included in their definition (number of non-hydrogens, ALOGP, TPSA, etc.) and when the property is easier to predict than the original pIC50.


Assuntos
Relação Quantitativa Estrutura-Atividade , Humanos , Concentração Inibidora 50 , Ligantes
13.
J Chem Inf Model ; 56(12): 2353-2360, 2016 12 27.
Artigo em Inglês | MEDLINE | ID: mdl-27958738

RESUMO

In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.


Assuntos
Relação Quantitativa Estrutura-Atividade , Algoritmos , Bases de Dados de Produtos Farmacêuticos , Descoberta de Drogas , Humanos , Modelos Biológicos , Software
14.
Molecules ; 21(10)2016 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-27689987

RESUMO

We apply matched molecular pair (MMP) analysis to data from ChirBase, which contains literature reports of chromatographic enantioseparations. For the 19 chiral stationary phases we examined, we were able to identify 289 sets of pairs where there is a statistically significant and consistent difference in enantioseparation due to a small chemical change. In many cases these changes highlight enantioselectivity differences between pairs or small families of closely related molecules that have for many years been used to probe the mechanisms of chromatographic chiral recognition; for example, the comparison of N-H vs. N-Me analytes to determine the criticality of an N-H hydrogen bond in chiral molecular recognition. In other cases, statistically significant MMPs surfaced by the analysis are less familiar or somewhat puzzling, sparking a need to generate and test hypotheses to more fully understand. Consequently, mining of appropriate datasets using MMP analysis provides an important new approach for studying and understanding the process of chromatographic enantioseparation.

15.
J Chem Inf Model ; 55(6): 1098-107, 2015 Jun 22.
Artigo em Inglês | MEDLINE | ID: mdl-25998559

RESUMO

UNLABELLED: In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities (an "activity model"). The aim of the field of domain applicability (DA) is to estimate the uncertainty of prediction of a specific molecule on a specific activity model. A number of DA metrics have been proposed in the literature for this purpose. A quantitative model of the prediction uncertainty (an "error model") can be built using one or more of these metrics. A previous publication from our laboratory ( Sheridan , R. P. J. Chem. Inf. MODEL: 2013 , 53 , 2837 - 2850 ) suggested that QSAR methods such as random forest could be used to build error models by fitting unsigned prediction errors against DA metrics. The QSAR paradigm contains two useful techniques: descriptor importance can determine which DA metrics are most useful, and cross-validation can be used to tell which subset of DA metrics is sufficient to estimate the unsigned errors. Previously we studied 10 large, diverse data sets and seven DA metrics. For those data sets for which it is possible to build a significant error model from those seven metrics, only two metrics were sufficient to account for almost all of the information in the error model. These were TREE_SD (the variation of prediction among random forest trees) and PREDICTED (the predicted activity itself). In this paper we show that when data sets are less diverse, as for example in QSAR models of molecules in a single chemical series, these two DA metrics become less important in explaining prediction error, and the DA metric SIMILARITYNEAREST1 (the similarity of the molecule being predicted to the closest training set compound) becomes more important. Our recommendation is that when the mean pairwise similarity (measured with the Carhart AP descriptor and the Dice similarity index) within a QSAR training set is less than 0.5, one can use only TREE_SD, PREDICTED to form the error model, but otherwise one should use TREE_SD, PREDICTED, SIMILARITYNEAREST1.


Assuntos
Informática/métodos , Relação Quantitativa Estrutura-Atividade , Bases de Dados de Produtos Farmacêuticos , Modelos Estatísticos , Incerteza
16.
J Chem Inf Model ; 55(2): 231-8, 2015 Feb 23.
Artigo em Inglês | MEDLINE | ID: mdl-25551659

RESUMO

During drug development, compounds are tested against counterscreens, a panel of off-target activities that would be undesirable for a drug to have. Testing every compound against every counterscreen is generally too costly in terms of time and money, and we need to find a rational way of prioritizing counterscreen testing. Here we present the eCounterscreening paradigm, wherein predictions from QSAR models for counterscreen activity are used to generate a recommendation as to whether a specific compound in a specific project should be tested against a specific counterscreen. The rules behind the recommendations, which can be summarized in a risk-benefit plot specific for a counterscreen/project combination, are based on a previously assembled database of prospective QSAR predictions. The recommendations require two user-defined cutoffs: the level of activity in a specific counterscreen that is considered undesirable and the level of risk the chemist is willing to accept that an undesired counterscreen activity will go undetected. We demonstrate in a simulated prospective experiment that eCounterscreening can be used to postpone a large fraction of counterscreen testing and still have an acceptably low risk of undetected counterscreen activity.


Assuntos
Ensaios de Triagem em Larga Escala/métodos , Relação Quantitativa Estrutura-Atividade , Algoritmos , Mineração de Dados , Bases de Dados Factuais , Descoberta de Drogas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Modelos Químicos , Valor Preditivo dos Testes , Medição de Risco
17.
J Chem Inf Model ; 55(2): 263-74, 2015 Feb 23.
Artigo em Inglês | MEDLINE | ID: mdl-25635324

RESUMO

Neural networks were widely used for quantitative structure-activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, more efficient training algorithms, and advancements in computer hardware. In particular, deep neural nets (DNNs), i.e. neural nets with more than one hidden layer, have found great successes in many applications, such as computer vision and natural language processing. Here we show that DNNs can routinely make better prospective predictions than RF on a set of large diverse QSAR data sets that are taken from Merck's drug discovery effort. The number of adjustable parameters needed for DNNs is fairly large, but our results show that it is not necessary to optimize them for individual data sets, and a single set of recommended parameters can achieve better performance than RF for most of the data sets we studied. The usefulness of the parameters is demonstrated on additional data sets not used in the calibration. Although training DNNs is still computationally intensive, using graphical processing units (GPUs) can make this issue manageable.


Assuntos
Redes Neurais de Computação , Relação Quantitativa Estrutura-Atividade , Algoritmos , Descoberta de Drogas , Aprendizado de Máquina , Estudos Prospectivos , Máquina de Vetores de Suporte , Fluxo de Trabalho
19.
J Chem Inf Model ; 54(4): 1083-92, 2014 Apr 28.
Artigo em Inglês | MEDLINE | ID: mdl-24628044

RESUMO

In the pharmaceutical industry, it is common for large numbers of compounds to be tested for off-target activities. Given a compound synthesized for an on-target project P, what is the best way to predict its off-target activity X? Is it better to use a global quantitative structure-activity relationship (QSAR) model calibrated against all compounds tested for X, or is it better to use a local model for X calibrated against only the set of compounds in project P? The literature is not consistent on this topic, and strong claims have been made for either. One particular idea is that local models will be superior to global models in prospective prediction if one generates many local models and chooses the type of local model that best predicts recent data. We tested this idea via simulated prospective prediction using in-house data involving compounds in 11 projects tested for 9 off-target activities. In our hands, the local model that best predicts the recent past is seldom the local model that is best at predicting the immediate future. Also, the local model that best predicts the recent past is not systematically better than the global model. This means the complexity of having project- or series-specific models for X can be avoided; a single global model for X is sufficient. We suggest that the relative predictivity of global vs local models may depend on the type of chemical descriptor used. Finally, we speculate why, contrary to observation, intuition suggests local models should be superior to global models.


Assuntos
Modelos Químicos , Relação Quantitativa Estrutura-Atividade , Calibragem
20.
J Chem Inf Model ; 54(6): 1604-16, 2014 Jun 23.
Artigo em Inglês | MEDLINE | ID: mdl-24802889

RESUMO

This paper brings together the concepts of molecular complexity and crowdsourcing. An exercise was done at Merck where 386 chemists voted on the molecular complexity (on a scale of 1-5) of 2681 molecules taken from various sources: public, licensed, and in-house. The meanComplexity of a molecule is the average over all votes for that molecule. As long as enough votes are cast per molecule, we find meanComplexity is quite easy to model with QSAR methods using only a handful of physical descriptors (e.g., number of chiral centers, number of unique topological torsions, a Wiener index, etc.). The high level of self-consistency of the model (cross-validated R(2) ∼0.88) is remarkable given that our chemists do not agree with each other strongly about the complexity of any given molecule. Thus, the power of crowdsourcing is clearly demonstrated in this case. The meanComplexity appears to be correlated with at least one metric of synthetic complexity from the literature derived in a different way and is correlated with values of process mass intensity (PMI) from the literature and from in-house studies. Complexity can be used to differentiate between in-house programs and to follow a program over time.


Assuntos
Crowdsourcing , Estrutura Molecular , Bases de Dados de Compostos Químicos , Humanos , Modelos Químicos , Relação Quantitativa Estrutura-Atividade , Estereoisomerismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA