RESUMO
Machine-learning (ML) and deep-learning (DL) approaches to predict the molecular properties of small molecules are increasingly deployed within the design-make-test-analyze (DMTA) drug design cycle to predict molecular properties of interest. Despite this uptake, there are only a few automated packages to aid their development and deployment that also support uncertainty estimation, model explainability, and other key aspects of model usage. This represents a key unmet need within the field, and the large number of molecular representations and algorithms (and associated parameters) means it is nontrivial to robustly optimize, evaluate, reproduce, and deploy models. Here, we present QSARtuna, a molecule property prediction modeling pipeline, written in Python and utilizing the Optuna, Scikit-learn, RDKit, and ChemProp packages, which enables the efficient and automated comparison between molecular representations and machine learning models. The platform was developed by considering the increasingly important aspect of model uncertainty quantification and explainability by design. We provide details for our framework and provide illustrative examples to demonstrate the capability of the software when applied to simple molecular property, reaction/reactivity prediction, and DNA encoded library enrichment classification. We hope that the release of QSARtuna will further spur innovation in automatic ML modeling and provide a platform for education of best practices in molecular property modeling. The code for the QSARtuna framework is made freely available via GitHub.
Assuntos
Desenho de Fármacos , Relação Quantitativa Estrutura-Atividade , Software , Aprendizado de Máquina , Modelos Moleculares , AutomaçãoRESUMO
Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.
Assuntos
Benchmarking , Relação Quantitativa Estrutura-Atividade , Bioensaio , Aprendizado de MáquinaRESUMO
We report a significant decrease in transcription of the G protein-coupled receptor GPR39 in striatal neurons of Parkinson's disease patients compared to healthy controls, suggesting that a positive modulator of GPR39 may beneficially impact neuroprotection. To test this notion, we developed various structurally diverse tool molecules. While we elaborated on previously reported starting points, we also performed an in silico screen which led to completely novel pharmacophores. In vitro studies indicated that GPR39 agonism does not have a profound effect on neuroprotection.
Assuntos
Pirimidinas/farmacologia , Receptores Acoplados a Proteínas G/agonistas , Regulação Alostérica/efeitos dos fármacos , Relação Dose-Resposta a Droga , Humanos , Estrutura Molecular , Pirimidinas/síntese química , Pirimidinas/química , Receptores Acoplados a Proteínas G/metabolismo , Relação Estrutura-AtividadeRESUMO
The understanding of the mechanism-of-action (MoA) of compounds and the prediction of potential drug targets play an important role in small-molecule drug discovery. The aim of this work was to compare chemical and cell morphology information for bioactivity prediction. The comparison was performed using bioactivity data from the ExCAPE database, image data (in the form of CellProfiler features) from the Cell Painting data set (the largest publicly available data set of cell images with â¼30,000 compound perturbations), and extended connectivity fingerprints (ECFPs) using the multitask Bayesian matrix factorization (BMF) approach Macau. We found that the BMF Macau and random forest (RF) performance were overall similar when ECFPs were used as compound descriptors. However, BMF Macau outperformed RF in 159 out of 224 targets (71%) when image data were used as compound information. Using BMF Macau, 100 (corresponding to about 45%) and 90 (about 40%) of the 224 targets were predicted with high predictive performance (AUC > 0.8) with ECFP data and image data as side information, respectively. There were targets better predicted by image data as side information, such as ß-catenin, and others better predicted by fingerprint-based side information, such as proteins belonging to the G-protein-Coupled Receptor 1 family, which could be rationalized from the underlying data distributions in each descriptor domain. In conclusion, both cell morphology changes and chemical structure information contain information about compound bioactivity, which is also partially complementary, and can hence contribute to in silico MoA analysis.
Assuntos
Descoberta de Drogas , Proteínas , Teorema de Bayes , Simulação por Computador , Bases de Dados FactuaisRESUMO
In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into a probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely, Platt scaling (PS), isotonic regression (IR), and Venn-ABERS predictors (VA), in calibrating prediction scores obtained from ligand-target prediction comprising the Naïve Bayes, support vector machines, and random forest (RF) algorithms. Calibration quality was assessed on bioactivity data available at AstraZeneca for 40 million data points (compound-target pairs) across 2112 targets and performance was assessed using stratified shuffle split (SSS) and leave 20% of scaffolds out (L20SO) validation. VA achieved the best calibration performances across all machine learning algorithms and cross validation methods tested and also the lowest (best) Brier score loss (mean squared difference between the outputted probability estimates assigned to a compound and the actual outcome). In comparison, the PS and IR methods can actually degrade the assigned probability estimates, particularly for the RF for SSS and during L20SO. Sphere exclusion, a method to sample additional (putative) inactive compounds, was shown to inflate the overall Brier score loss performance, through the artificial requirement for inactive molecules to be dissimilar to active compounds, but was shown to result in overconfident estimators. VA was able to successfully calibrate the probability estimates for even small calibration sets. The multiprobability values (lower and upper probability boundary intervals) were shown to produce large discordance for test set molecules that are neither very similar nor very dissimilar to the active training set, which were hence difficult to predict, suggesting that multiprobability discordance can be used as an estimate for target prediction uncertainty. Overall, we were able to show in this work that VA scaling of target prediction models is able to improve probability estimates in all testing instances and is currently being applied for in-house approaches.
Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Teorema de Bayes , Ligantes , ProbabilidadeRESUMO
Motivation: In silico approaches often fail to utilize bioactivity data available for orthologous targets due to insufficient evidence highlighting the benefit for such an approach. Deeper investigation into orthologue chemical space and its influence toward expanding compound and target coverage is necessary to improve the confidence in this practice. Results: Here we present analysis of the orthologue chemical space in ChEMBL and PubChem and its impact on target prediction. We highlight the number of conflicting bioactivities between human and orthologues is low and annotations are overall compatible. Chemical space analysis shows orthologues are chemically dissimilar to human with high intra-group similarity, suggesting they could effectively extend the chemical space modelled. Based on these observations, we show the benefit of orthologue inclusion in terms of novel target coverage. We also benchmarked predictive models using a time-series split and also using bioactivities from Chemistry Connect and HTS data available at AstraZeneca, showing that orthologue bioactivity inclusion statistically improved performance. Availability and implementation: Orthologue-based bioactivity prediction and the compound training set are available at www.github.com/lhm30/PIDGINv2. Contact: ab454@cam.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional/métodos , Simulação por Computador , Descoberta de Drogas/métodos , Proteínas/metabolismo , Homologia de Sequência de Aminoácidos , Animais , Humanos , Ligantes , Modelos Biológicos , Proteínas/efeitos dos fármacosRESUMO
One important, however, poorly understood, concept of Traditional Chinese Medicine (TCM) is that of hot, cold, and neutral nature of its bioactive principles. To advance the field, in this study, we analyzed compound-nature pairs from TCM on a large scale (>23â¯000 structures) via chemical space visualizations to understand its physicochemical domain and in silico target prediction to understand differences related to their modes-of-action (MoA) against proteins. We found that overall TCM natures spread into different subclusters with specific molecular patterns, as opposed to forming coherent global groups. Compounds associated with cold nature had a lower clogP and contain more aliphatic rings than the other groups and were found to control detoxification, heat-clearing, heart development processes, and have sedative function, associated with "Mental and behavioural disorders" diseases. While compounds associated with hot nature were on average of lower molecular weight, have more aromatic ring systems than other groups, frequently seemed to control body temperature, have cardio-protection function, improve fertility and sexual function, and represent excitatory or activating effects, associated with "endocrine, nutritional and metabolic diseases" and "diseases of the circulatory system". Compounds associated with neutral nature had a higher polar surface area and contain more cyclohexene moieties than other groups and seem to be related to memory function, suggesting that their nature may be a useful guide for their utility in neural degenerative diseases. We were hence able to elucidate the difference between different nature classes in TCM on the molecular level, and on a large data set, for the first time, thereby helping a better understanding of TCM nature theory and bridging the gap between traditional medicine and our current understanding of the human body.
Assuntos
Simulação por Computador , Medicina Tradicional Chinesa , Terapia de Alvo MolecularRESUMO
REINVENT 4 is a modern open-source generative AI framework for the design of small molecules. The software utilizes recurrent neural networks and transformer architectures to drive molecule generation. These generators are seamlessly embedded within the general machine learning optimization algorithms, transfer learning, reinforcement learning and curriculum learning. REINVENT 4 enables and facilitates de novo design, R-group replacement, library design, linker design, scaffold hopping and molecule optimization. This contribution gives an overview of the software and describes its design. Algorithms and their applications are discussed in detail. REINVENT 4 is a command line tool which reads a user configuration in either TOML or JSON format. The aim of this release is to provide reference implementations for some of the most common algorithms in AI based molecule generation. An additional goal with the release is to create a framework for education and future innovation in AI based molecular design. The software is available from https://github.com/MolecularAI/REINVENT4 and released under the permissive Apache 2.0 license. Scientific contribution. The software provides an open-source reference implementation for generative molecular design where the software is also being used in production to support in-house drug discovery projects. The publication of the most common machine learning algorithms in one code and full documentation thereof will increase transparency of AI and foster innovation, collaboration and education.
RESUMO
The multi-step retrosynthesis problem can be solved by a search algorithm, such as Monte Carlo tree search (MCTS). The performance of multistep retrosynthesis, as measured by a trade-off in search time and route solvability, therefore depends on the hyperparameters of the search algorithm. In this paper, we demonstrated the effect of three MCTS hyperparameters (number of iterations, tree depth, and tree width) on metrics such as Linear integrated speed-accuracy score (LISAS) and Inverse efficiency score which consider both route solvability and search time. This exploration was conducted by employing three data-driven approaches, namely a systematic grid search, Bayesian optimization over an ensemble of molecules to obtain static MCTS hyperparameters, and a machine learning approach to dynamically predict optimal MCTS hyperparameters given an input target molecule. With the obtained results on the internal dataset, we demonstrated that it is possible to identify a hyperparameter set which outperforms the current AiZynthFinder default setting. It appeared optimal across a variety of target input molecules, both on proprietary and public datasets. The settings identified with the in-house dataset reached a solvability of 93 % and median search time of 151â s for the in-house dataset, and a 74 % solvability and 114â s for the ChEMBL dataset. These numbers can be compared to the current default settings which solved 85 % and 73 % during a median time of 110s and 84â s, for in-house and ChEMBL, respectively.
Assuntos
Algoritmos , Benchmarking , Teorema de Bayes , Aprendizado de Máquina , Método de Monte CarloRESUMO
In this mini review, we capture the latest progress of applying artificial intelligence (AI) techniques based on deep learning architectures to molecular de novo design with a focus on integration with experimental validation. We will cover the progress and experimental validation of novel generative algorithms, the validation of QSAR models and how AI-based molecular de novo design is starting to become connected with chemistry automation. While progress has been made in the last few years, it is still early days. The experimental validations conducted thus far should be considered proof-of-principle, providing confidence that the field is moving in the right direction.
Assuntos
Algoritmos , Inteligência Artificial , Automação , Desenho de FármacosRESUMO
Uncontrolled angiogenesis is a common denominator underlying many deadly and debilitating diseases such as myocardial infarction, chronic wounds, cancer, and age-related macular degeneration. As the current range of FDA-approved angiogenesis-based medicines are far from meeting clinical demands, the vast reserve of natural products from traditional Chinese medicine (TCM) offers an alternative source for developing pro-angiogenic or anti-angiogenic modulators. Here, we investigated 100 traditional Chinese medicine-derived individual metabolites which had reported gene expression in MCF7 cell lines in the Gene Expression Omnibus (GSE85871). We extracted literature angiogenic activities for 51 individual metabolites, and subsequently analysed their predicted targets and differentially expressed genes to understand their mechanisms of action. The angiogenesis phenotype was used to generate decision trees for rationalising the poly-pharmacology of known angiogenesis modulators such as ferulic acid and curculigoside and validated by an in vitro endothelial tube formation assay and a zebrafish model of angiogenesis. Moreover, using an in silico model we prospectively examined the angiogenesis-modulating activities of the remaining 49 individual metabolites. In vitro, tetrahydropalmatine and 1 beta-hydroxyalantolactone stimulated, while cinobufotalin and isoalantolactone inhibited endothelial tube formation. In vivo, ginsenosides Rb3 and Rc, 1 beta-hydroxyalantolactone and surprisingly cinobufotalin, restored angiogenesis against PTK787-induced impairment in zebrafish. In the absence of PTK787, deoxycholic acid and ursodeoxycholic acid did not affect angiogenesis. Despite some limitations, these results suggest further refinements of in silico prediction combined with biological assessment will be a valuable platform for accelerating the research and development of natural products from traditional Chinese medicine and understanding their mechanisms of action, and also for other traditional medicines for the prevention and treatment of angiogenic diseases.
RESUMO
In image-based profiling, software extracts thousands of morphological features of cells from multi-channel fluorescence microscopy images, yielding single-cell profiles that can be used for basic research and drug discovery. Powerful applications have been proven, including clustering chemical and genetic perturbations on the basis of their similar morphological impact, identifying disease phenotypes by observing differences in profiles between healthy and diseased cells and predicting assay outcomes by using machine learning, among many others. Here, we provide an updated protocol for the most popular assay for image-based profiling, Cell Painting. Introduced in 2013, it uses six stains imaged in five channels and labels eight diverse components of the cell: DNA, cytoplasmic RNA, nucleoli, actin, Golgi apparatus, plasma membrane, endoplasmic reticulum and mitochondria. The original protocol was updated in 2016 on the basis of several years' experience running it at two sites, after optimizing it by visual stain quality. Here, we describe the work of the Joint Undertaking for Morphological Profiling Cell Painting Consortium, to improve upon the assay via quantitative optimization by measuring the assay's ability to detect morphological phenotypes and group similar perturbations together. The assay gives very robust outputs despite various changes to the protocol, and two vendors' dyes work equivalently well. We present Cell Painting version 3, in which some steps are simplified and several stain concentrations can be reduced, saving costs. Cell culture and image acquisition take 1-2 weeks for typically sized batches of ≤20 plates; feature extraction and data analysis take an additional 1-2 weeks.This protocol is an update to Nat. Protoc. 11, 1757-1774 (2016): https://doi.org/10.1038/nprot.2016.105.
Assuntos
Técnicas de Cultura de Células , Processamento de Imagem Assistida por Computador , Processamento de Imagem Assistida por Computador/métodos , Microscopia de Fluorescência , Mitocôndrias , SoftwareRESUMO
PROteolysis TArgeting Chimeras (PROTACs) use the ubiquitin-proteasome system to degrade a protein of interest for therapeutic benefit. Advances made in targeted protein degradation technology have been remarkable, with several molecules having moved into clinical studies. However, robust routes to assess and better understand the safety risks of PROTACs need to be identified, which is an essential step toward delivering efficacious and safe compounds to patients. In this work, we used Cell Painting, an unbiased high-content imaging method, to identify phenotypic signatures of PROTACs. Chemical clustering and model prediction allowed the identification of a mitotoxicity signature that could not be expected by screening the individual PROTAC components. The data highlighted the benefit of unbiased phenotypic methods for identifying toxic signatures and the potential to impact drug design.
Assuntos
Ensaios de Triagem em Larga Escala , Proteólise , Ubiquitina-Proteína Ligases , Humanos , Complexo de Endopeptidases do Proteassoma/metabolismo , Ubiquitina-Proteína Ligases/metabolismoRESUMO
Functional magnetic resonance imaging (fMRI) is an extensively used method for the investigation of normal and pathological brain function. In particular, fMRI has been used to characterize spatiotemporal hemodynamic response to pharmacological challenges as a non-invasive readout of neuronal activity. However, the mechanisms underlying regional signal changes are yet unclear. In this study, we use a meta-analytic approach to converge data from microdialysis experiments with relative cerebral blood volume (rCBV) changes following acute administration of neuropsychiatric drugs in adult male rats. At whole-brain level, the functional response patterns show very weak correlation with neurochemical alterations, while for numerous brain areas a strong positive correlation with noradrenaline release exists. At a local scale of individual brain regions, the rCBV response to neurotransmitters is anatomically heterogeneous and, importantly, based on a complex interplay of different neurotransmitters that often exert opposing effects, thus providing a mechanism for regulating and fine tuning hemodynamic responses in specific regions.
Assuntos
Química Encefálica/efeitos dos fármacos , Circulação Cerebrovascular/efeitos dos fármacos , Hemodinâmica/efeitos dos fármacos , Psicotrópicos/farmacologia , Animais , Humanos , Imageamento por Ressonância Magnética , MicrodiáliseRESUMO
Machine learning and artificial intelligence are increasingly being applied to the drug-design process as a result of the development of novel algorithms, growing access, the falling cost of computation and the development of novel technologies for generating chemically and biologically relevant data. There has been recent progress in fields such as molecular de novo generation, synthetic route prediction and, to some extent, property predictions. Despite this, most research in these fields has focused on improving the accuracy of the technologies, rather than on quantifying the uncertainty in the predictions. Uncertainty quantification will become a key component in autonomous decision making and will be crucial for integrating machine learning and chemistry automation to create an autonomous design-make-test-analyse cycle. This review covers the empirical, frequentist and Bayesian approaches to uncertainty quantification, and outlines how they can be used for drug design. We also outline the impact of uncertainty quantification on decision making.
Assuntos
Desenho de Fármacos , Incerteza , Algoritmos , Inteligência Artificial , Automação , Teorema de Bayes , Humanos , Aprendizado de MáquinaRESUMO
Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.
RESUMO
Dichapetalum madagascariense Poir (Dichapetalaceae) is traditionally used to treat bacterial infections, jaundice, urethritis and viral hepatitis in Africa. Its root contains a broad spectrum of biologically active dichapetalins. To evaluate the plant's effect on human MCF-7 cells and its' antibacterial and antiparasitic potentials, we isolated and identified the known dichapetalins A and M from the roots. Both dichapetalins were tested on six bacterial strains (Shigella flexneri, Bacillus cereus, Salmonella paratyphi B, Listeria monocytogenes, Escherichia coli, Staphylococcus aureus) and two parasite strains; Trypanosoma brucei brucei, and Leishmania donovani using the Alamar Blue assay system. Dichapetalins A and M were more potent against B. cereus with IC50 values of 11.15 and 3.15 µg/ml, respectively, compared to the positive control ampicillin (IC50 = 19.50 µg/ml). Dichapetalins A (IC50 = 74.22 µg/ml) and M (IC50 = 72.34 µg/ml) were less active against T. b. brucei, compared to the standard Suramin (IC50 = 4.96 µg/ml). Dichapetalin M showed moderate activity against L. donovani (Amphotericin B: IC50 = 0.21 µg/ml) with an IC50 of 16.80 µg/ml. In human MCF-7 cells expressing the NR1I2 receptor, the activity of dichapetalin M was higher (IC50 = 4.71 µM and 3.95 µM) for 48 and 72 h of treatment, respectively compared to Curcumin with IC50 of 17.49 µM and 12.53 µM for 48 and 72 h of treatment, respectively. Results from in vitro expression studies with qPCR confirmed an antagonistic effect of dichapetalin M on PXR (NR1I2) signaling; supporting the PXR signaling pathway as a possible mode of action of dichapetalin M as predicted by in silico results. These findings confirm previous studies that D. madagascariense can be a source of potential lead compounds for development of novel antibiotic, antiparasitic and anticancer medicines, and provide further insights into the mechanism of action of the dichapetalins.
Assuntos
Antibacterianos , Extratos Vegetais/farmacologia , África , Antibacterianos/farmacologia , Simulação por Computador , Humanos , Testes de Sensibilidade Microbiana , Staphylococcus aureusRESUMO
Despite the increasing knowledge in both the chemical and biological domains the assimilation and exploration of heterogeneous datasets, encoding information about the chemical, bioactivity and phenotypic properties of compounds, remains a challenge due to requirement for overlap between chemicals assayed across the spaces. Here, we have constructed a novel dataset, larger than we have used in prior work, comprising 579 acute oral toxic compounds and 1427 non-toxic compounds derived from regulatory GHS information, along with their corresponding molecular and protein target descriptors and qHTS in vitro assay readouts from the Tox21 project. We found no clear association between the results of a FAFDrugs4 toxicophore screen and the acute oral toxicity classifications for our compound set; and a screen using a subset of the ToxAlerts toxicophores was also of limited utility, with only slight enrichment toward the toxic set (odds ratio of 1.48). We then investigated to what degree toxic and non-toxic compounds could be separated in each of the spaces, to compare their potential contribution to further analyses. Using an LDA projection, we found the largest degree of separation using chemical descriptors (Cohen's d of 1.95) and the lowest degree of separation between toxicity classes using qHTS descriptors (Cohen's d of 0.67). To compare the predictivity of the feature spaces for the toxicity endpoint, we next trained Random Forest (RF) acute oral toxicity classifiers on either molecular, protein target and qHTS descriptors. RFs trained on molecular and protein target descriptors were most predictive, with ROC AUC values of 0.80-0.92 and 0.70-0.85, respectively, across three test sets. RFs trained on both chemical and protein target descriptors combined exhibited similar predictive performance to the single-domain models (ROC AUC of 0.80-0.91). Model interpretability was improved by the inclusion of protein target descriptors, which allow the identification of specific targets (e.g. Retinal dehydrogenase) with literature links to toxic modes of action (e.g. oxidative stress). The dataset compiled in this study has been made available for future application.
RESUMO
In silico protein target deconvolution is frequently used for mechanism-of-action investigations; however existing protocols usually do not predict compound functional effects, such as activation or inhibition, upon binding to their protein counterparts. This study is hence concerned with including functional effects in target prediction. To this end, we assimilated a bioactivity training set for 332 targets, comprising 817,239 active data points with unknown functional effect (binding data) and 20,761,260 inactive compounds, along with 226,045 activating and 1,032,439 inhibiting data points from functional screens. Chemical space analysis of the data first showed some separation between compound sets (binding and inhibiting compounds were more similar to each other than both binding and activating or activating and inhibiting compounds), providing a rationale for implementing functional prediction models. We employed three different architectures to predict functional response, ranging from simplistic random forest models ('Arch1') to cascaded models which use separate binding and functional effect classification steps ('Arch2' and 'Arch3'), differing in the way training sets were generated. Fivefold stratified cross-validation outlined cascading predictions provides superior precision and recall based on an internal test set. We next prospectively validated the architectures using a temporal set of 153,467 of in-house data points (after a 4-month interim from initial data extraction). Results outlined Arch3 performed with the highest target class averaged precision and recall scores of 71% and 53%, which we attribute to the use of inactive background sets. Distance-based applicability domain (AD) analysis outlined that Arch3 provides superior extrapolation into novel areas of chemical space, and thus based on the results presented here, propose as the most suitable architecture for the functional effect prediction of small molecules. We finally conclude including functional effects could provide vital insight in future studies, to annotate cases of unanticipated functional changeover, as outlined by our CHRM1 case study.
RESUMO
Neuropsychiatric disorders are the third leading cause of global disease burden. Current pharmacological treatment for these disorders is inadequate, with often insufficient efficacy and undesirable side effects. One reason for this is that the links between molecular drug action and neurobehavioral drug effects are elusive. We use a big data approach from the neurotransmitter response patterns of 258 different neuropsychiatric drugs in rats to address this question. Data from experiments comprising 110,674 rats are presented in the Syphad database [ www.syphad.org ]. Chemoinformatics analyses of the neurotransmitter responses suggest a mismatch between the current classification of neuropsychiatric drugs and spatiotemporal neurostransmitter response patterns at the systems level. In contrast, predicted drug-target interactions reflect more appropriately brain region related neurotransmitter response. In conclusion the neurobiological mechanism of neuropsychiatric drugs are not well reflected by their current classification or their chemical similarity, but can be better captured by molecular drug-target interactions.