RESUMO
Currently, G protein-coupled receptors (GPCRs) constitute a significant group of membrane-bound receptors representing more than 30% of therapeutic targets. Fluorine is commonly used in designing highly active biological compounds, as evidenced by the steadily increasing number of drugs by the Food and Drug Administration (FDA). Herein, we identified and analyzed 898 target-based F-containing isomeric analog sets for SAR analysis in the ChEMBL database-FiSAR sets active against 33 different aminergic GPCRs comprising a total of 2163 fluorinated (1201 unique) compounds. We found 30 FiSAR sets contain activity cliffs (ACs), defined as pairs of structurally similar compounds showing significant differences in affinity (≥50-fold change), where the change of fluorine position may lead up to a 1300-fold change in potency. The analysis of matched molecular pair (MMP) networks indicated that the fluorination of aromatic rings showed no clear trend toward a positive or negative effect on affinity. Additionally, we propose an in silico workflow (including induced-fit docking, molecular dynamics, quantum polarized ligand docking, and binding free energy calculations based on the Generalized-Born Surface-Area (GBSA) model) to score the fluorine positions in the molecule.
Assuntos
Flúor , Simulação de Dinâmica Molecular , Flúor/química , Ligação Proteica , Receptores Acoplados a Proteínas G/química , Isomerismo , Ligantes , Simulação de Acoplamento MolecularRESUMO
Antibacterial drugs (AD) change the metabolic status of bacteria, contributing to bacterial death. However, antibiotic resistance and the emergence of multidrug-resistant bacteria increase interest in understanding metabolic network (MN) mutations and the interaction of AD vs MN. In this study, we employed the IFPTML = Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML) algorithm on a huge dataset from the ChEMBL database, which contains >155,000 AD assays vs >40 MNs of multiple bacteria species. We built a linear discriminant analysis (LDA) and 17 ML models centered on the linear index and based on atoms to predict antibacterial compounds. The IFPTML-LDA model presented the following results for the training subset: specificity (Sp) = 76% out of 70,000 cases, sensitivity (Sn) = 70%, and Accuracy (Acc) = 73%. The same model also presented the following results for the validation subsets: Sp = 76%, Sn = 70%, and Acc = 73.1%. Among the IFPTML nonlinear models, the k nearest neighbors (KNN) showed the best results with Sn = 99.2%, Sp = 95.5%, Acc = 97.4%, and Area Under Receiver Operating Characteristic (AUROC) = 0.998 in training sets. In the validation series, the Random Forest had the best results: Sn = 93.96% and Sp = 87.02% (AUROC = 0.945). The IFPTML linear and nonlinear models regarding the ADs vs MNs have good statistical parameters, and they could contribute toward finding new metabolic mutations in antibiotic resistance and reducing time/costs in antibacterial drug research.
Assuntos
Antibacterianos , Aprendizado de Máquina , Algoritmos , Antibacterianos/farmacologia , Bases de Dados Factuais , Redes e Vias MetabólicasRESUMO
In drug discovery, partition and distribution coefficients, logP and logD for octanol/water, are widely used as metrics of the lipophilicity of molecules, which in turn have a strong influence on the bioactivity and bioavailability of potential drugs. There are a variety of established methods, mostly fragment or atom-based, to calculate logP while logD prediction generally relies on calculated logP and pKa for the estimation of neutral and ionized populations at a given pH. Algorithms such as ClogP have limitations generally leading to systematic errors for chemically related molecules while pKa estimation is generally more difficult due to the interplay of electronic, inductive and conjugation effects for ionizable moieties. We propose an integrated machine learning QSAR modeling approach to predict logD by training the model with experimental data while using ClogP and pKa predicted by commercial software as model descriptors. By optimizing the loss function for the ClogD calculated by the software, we build a correction model that incorporates both descriptors from the software and available experimental logD data. Additionally, we calculate logP from the logD model using the software predicted pKa's. Here, we have trained models using publicly or commercial available logD data to show that this approach can improve on commercial software predictions of lipophilicity. When applied to other logD data sets, this approach extends the domain of applicability of logD and logP predictions over commercial software. Performance of these models favorably compare with models built with a larger set of proprietary logD data.
Assuntos
Software , Água , Algoritmos , Aprendizado de Máquina , Octanóis/química , Água/químicaRESUMO
Natural products, such as humic substances (HS) and shilajit, are known to possess antiviral activity. Humic-like components are often called as carriers of biological activity of shilajit. The goal of this study was to evaluate anti-HIV activity of well characterized HS isolated from coal, peat, and peloids, and compare it to that of water-soluble organic matter (OM) isolated from different samples of Shilajit. The set of humic materials included 16 samples of different fractional composition: humic acid (HA), hymatomelanic acid (HMA), fulvic acid (FA). The set of shilajit OM included 19 samples of different geographic origin and level of alteration. The HIV-1 p24 antigen assay and cell viability test were used for assessment of antiviral activity. The HIV-1 Bru strain was used to infect CEM-SS cells. The obtained EC50 values varied from 0.37 to 1.4 mg L-1 for the humic materials, and from 14 to 142 mg L-1 for the shilajit OM. Hence, all humic materials used in this study outcompeted largely the shilajit materials with respect to anti-HIV activity: For the humic materials, the structure-activity relationships revealed strong correlation between the EC50 values and the content of aromatic carbon indicating the most important role of aromatic structures. For shilajit OM, the reverse relationship was obtained indicating the different mechanism of shilajit activity. The FTICRMS molecular assignments were used for ChEMBL data mining in search of the active humic molecules. As potential carriers of antiviral activity were identified aromatic structures with alkyl substituents, terpenoids, N-containing analogs of typical flavonoids, and aza-podophyllotoxins. The conclusion was made that the typical humic materials and Shilajit differ greatly in molecular composition, and the humic materials have substantial preferences as a natural source of antiviral agents as compared to shilajit.
Assuntos
HIV-1 , Substâncias Húmicas , Antivirais/farmacologia , Benzopiranos/farmacologia , Substâncias Húmicas/análise , Minerais , Resinas Vegetais , SoloRESUMO
The parasite species of genus Plasmodium causes Malaria, which remains a major global health problem due to parasite resistance to available Antimalarial drugs and increasing treatment costs. Consequently, computational prediction of new Antimalarial compounds with novel targets in the proteome of Plasmodium sp. is a very important goal for the pharmaceutical industry. We can expect that the success of the pre-clinical assay depends on the conditions of assay per se, the chemical structure of the drug, the structure of the target protein to be targeted, as well as on factors governing the expression of this protein in the proteome such as genes (Deoxyribonucleic acid, DNA) sequence and/or chromosomes structure. However, there are no reports of computational models that consider all these factors simultaneously. Some of the difficulties for this kind of analysis are the dispersion of data in different datasets, the high heterogeneity of data, etc. In this work, we analyzed three databases ChEMBL (Chemical database of the European Molecular Biology Laboratory), UniProt (Universal Protein Resource), and NCBI-GDV (National Center for Biotechnology Information-Genome Data Viewer) to achieve this goal. The ChEMBL dataset contains outcomes for 17,758 unique assays of potential Antimalarial compounds including numeric descriptors (variables) for the structure of compounds as well as a huge amount of information about the conditions of assays. The NCBI-GDV and UniProt datasets include the sequence of genes, proteins, and their functions. In addition, we also created two partitions (cassayj = caj and cdataj = cdj) of categorical variables from theChEMBL dataset. These partitions contain variables that encode information about experimental conditions of preclinical assays (caj) or about the nature and quality of data (cdj). These categorical variables include information about 22 parameters of biological activity (ca0), 28 target proteins (ca1), and 9 organisms of assay (ca2), etc. We also created another partition of (cprotj = cpj) including categorical variables with biological information about the target proteins, genes, and chromosomes. These variables cover32 genes (cp0), 10 chromosomes (cp1), gene orientation (cp2), and 31 protein functions (cp3). We used a Perturbation-Theory Machine Learning Information Fusion (IFPTML) algorithm to map all this information (from three databases) into and train a predictive model. Shannon's entropy measure Shk (numerical variables) was used to quantify the information about the structure of drugs, protein sequences, gene sequences, and chromosomes in the same information scale. Perturbation Theory Operators (PTOs) with the form of Moving Average (MA) operators have been used to quantify perturbations (deviations) in the structural variables with respect to their expected values for different subsets (partitions) of categorical variables. We obtained three IFPTML models using General Discriminant Analysis (GDA), Classification Tree with Univariate Splits (CTUS), and Classification Tree with Linear Combinations (CTLC). The IFPTML-CTLC presented the better performance with Sensitivity Sn(%) = 83.6/85.1, and Specificity Sp(%) = 89.8/89.7 for training/validation sets, respectively. This model could become a useful tool for the optimization of preclinical assays of new Antimalarial compounds vs. different proteins in the proteome of Plasmodium.
Assuntos
Antimaláricos/farmacologia , Descoberta de Drogas/métodos , Aprendizado de Máquina , Plasmodium falciparum/genética , Algoritmos , Antimaláricos/química , Bases de Dados de Produtos Farmacêuticos , Avaliação Pré-Clínica de Medicamentos , Genoma de Protozoário , Cadeias de Markov , Modelos Teóricos , Proteínas de Protozoários/química , Proteínas de Protozoários/genética , Proteínas de Protozoários/metabolismo , Reprodutibilidade dos TestesRESUMO
The theoretical prediction of drug-decorated nanoparticles (DDNPs) has become a very important task in medical applications. For the current paper, Perturbation Theory Machine Learning (PTML) models were built to predict the probability of different pairs of drugs and nanoparticles creating DDNP complexes with anti-glioblastoma activity. PTML models use the perturbations of molecular descriptors of drugs and nanoparticles as inputs in experimental conditions. The raw dataset was obtained by mixing the nanoparticle experimental data with drug assays from the ChEMBL database. Ten types of machine learning methods have been tested. Only 41 features have been selected for 855,129 drug-nanoparticle complexes. The best model was obtained with the Bagging classifier, an ensemble meta-estimator based on 20 decision trees, with an area under the receiver operating characteristic curve (AUROC) of 0.96, and an accuracy of 87% (test subset). This model could be useful for the virtual screening of nanoparticle-drug complexes in glioblastoma. All the calculations can be reproduced with the datasets and python scripts, which are freely available as a GitHub repository from authors.
Assuntos
Antineoplásicos/administração & dosagem , Neoplasias Encefálicas/tratamento farmacológico , Sistemas de Liberação de Medicamentos , Glioblastoma/tratamento farmacológico , Aprendizado de Máquina , Nanopartículas , Bases de Dados de Compostos Químicos , Bases de Dados de Produtos Farmacêuticos , Portadores de Fármacos/administração & dosagem , Desenho de Fármacos , Ensaios de Seleção de Medicamentos Antitumorais , Humanos , Nanopartículas/administração & dosagem , Interface Usuário-ComputadorRESUMO
A method is presented to analyze quantitatively the degree of congenericity of claimed compounds in patent applications. The approach successfully differentiates patents exemplified with highly congeneric compounds of a structurally compact and well defined chemical series from patents containing a more diverse set of compounds around a more vaguely described patent claim. An application to 750 common patents available in SureChEMBL, SureChEMBLccs and ChEMBL is presented and the congenericity of patent compounds in those different sources discussed.
RESUMO
Nanosystems are gaining momentum in pharmaceutical sciences because of the wide variety of possibilities for designing these systems to have specific functions. Specifically, studies of new cancer cotherapy drug-vitamin release nanosystems (DVRNs) including anticancer compounds and vitamins or vitamin derivatives have revealed encouraging results. However, the number of possible combinations of design and synthesis conditions is remarkably high. In addition, a large number of anticancer and vitamin derivatives have been already assayed, but a notably less number of cases of DVRNs were assayed as a whole (with the anticancer compound and the vitamin linked to them). Our approach combines with the perturbation theory and machine learning (PTML) model to predict the probability of obtaining an interesting DVRN by changing the anticancer compound and/or the vitamin present in a DVRN that is already tested for other anticancer compounds or vitamins that have not been tested yet as part of a DVRN. In a previous work, we built a linear PTML model useful for the design of these nanosystems. In doing so, we used information fusion (IF) techniques to carry out data enrichment of DVRN data compiled from the literature with the data for preclinical assays of vitamins from the ChEMBL database. The design features of DVRNs and the assay conditions of nanoparticles (NPs) and vitamins were included as multiplicative PT operators (PTOs) to the system, which indicates the importance of these variables. However, the previous work omitted experiments with nonlinear ML techniques and different types of PTOs such as metric-based PTOs. More importantly, the previous work does not consider the structure of the anticancer drug to be included in the new DVRNs. In this work, we are going to accomplish three main objectives (tasks). In the first task, we found a new model, alternative to the one published before, for the rational design of DVRNs using metric-based PTOs. The most accurate PTML model was the artificial neural network model, which showed values of specificity, sensitivity, and accuracy in the range of 90-95% in training and external validation series for more than 130,000 cases (DVRNs vs ChEMBL assays). Furthermore, in the second task, we used IF techniques to carry out data enrichment of our previous data set. In doing so, we constructed a new working data set of >970,000 cases with the data of preclinical assays of DVRNs, vitamins, and anticancer compounds from the ChEMBL database. All these assays have multiple continuous variables or descriptors dk and categorical variables cj (conditions of the assays) for drugs (dack, cacj), vitamins (dvk, cvj), and NPs (dnk, cnj). These data include >20,000 potential anticancer compounds with >270 protein targets (cac1), >580 assay cell organisms (cac2), and so forth. Furthermore, we include >36,000 assay vitamin derivatives in >6200 types of cells (c2vit), >120 assay organisms (c3vit), >60 assay strains (c4vit), and so forth. The enriched data set also contains >20 types of DVRNs (c5n) with 9 NP core materials (c4n), 8 synthesis methods (c7n), and so forth. We expressed all this information with PTOs and developed a qualitatively new PTML model that incorporates information of the anticancer drugs. This new model presents 96-97% of accuracy for training and external validation subsets. In the last task, we carried out a comparative study of ML and/or PTML models published and described how the models we are presenting cover the gap of knowledge in terms of drug delivery. In conclusion, we present here for the first time a multipurpose PTML model that is able to select NPs, anticancer compounds, and vitamins and their conditions of assay for DVRN design.
Assuntos
Antineoplásicos/administração & dosagem , Protocolos de Quimioterapia Combinada Antineoplásica/administração & dosagem , Sistemas de Liberação de Medicamentos/métodos , Nanopartículas/química , Neoplasias/tratamento farmacológico , Vitaminas/administração & dosagem , Big Data , Simulação por Computador , Bases de Dados Factuais , Liberação Controlada de Fármacos , Modelos Lineares , Aprendizado de MáquinaRESUMO
A great variety of computational approaches support drug design processes, helping in selection of new potentially active compounds, and optimization of their physicochemical and ADMET properties. Machine learning is a group of methods that are able to evaluate in relatively short time enormous amounts of data. However, the quality of machine-learning-based prediction depends on the data supplied for model training. In this study, we used deep neural networks for the task of compound activity prediction and developed dropout-based approaches for estimating prediction uncertainty. Several types of analyses were performed: the relationships between the prediction error, similarity to the training set, prediction uncertainty, number and standard deviation of activity values were examined. It was tested whether incorporation of information about prediction uncertainty influences compounds ranking based on predicted activity and prediction uncertainty was used to search for the potential errors in the ChEMBL database. The obtained outcome indicates that incorporation of information about uncertainty of compound activity prediction can be of great help during virtual screening experiments.
Assuntos
Bases de Dados de Compostos Químicos , Aprendizado Profundo , Desenho de Fármacos , Descoberta de Drogas , Modelos QuímicosRESUMO
Retroviral infections, such as HIV, are, until now, diseases with no cure. Medicine and pharmaceutical chemistry need and consider it a huge goal to define target proteins of new antiretroviral compounds. ChEMBL manages Big Data features with a complex data set, which is hard to organize. This makes information difficult to analyze due to a big number of characteristics described in order to predict new drug candidates for retroviral infections. For this reason, we propose to develop a new predictive model combining perturbation theory (PT) bases and machine learning (ML) modeling to create a new tool that can take advantage of all the available information. The PTML model proposed in this work for the ChEMBL data set preclinical experimental assays for antiretroviral compounds consists of a linear equation with four variables. The PT operators used are founded on multicondition moving averages, combining different features and simplifying the difficulty to manage all data. More than 140â¯000 preclinical assays for 56â¯105 compounds with different characteristics or experimental conditions have been carried out and can be found in ChEMBL database, covering combinations with 359 biological activity parameters (c0), 55 protein accessions (c1), 83 cell lines (c2), 64 organisms of assay (c3), and 773 subtypes or strains. We have included 150â¯148 preclinical experimental assays for HIV virus, 1188 for HTLV virus, 84 for simian immunodeficiency virus, 370 for murine leukemia virus, 119 for Rous sarcoma virus, 1581 for MMTV, etc. We also included 5277 assays for hepatitis B virus. The developed PTML model reached considerable values in sensibility (73.05% for training and 73.10% for validation), specificity (86.61% for training and 87.17% for validation), and accuracy (75.84% for training and 75.98% for validation). We also compared alternative PTML models with different PT operators such as covariance, moments, and exponential terms. Finally, we made a comparison between literature ML models with our PTML model and also artificial neural network (ANN) nonlinear models. We conclude that this PTML model is the first one to consider multiple characteristics of preclinical experimental antiretroviral assays combined, generating a simple, useful, and adaptable instrument, which could reduce time and costs in antiretroviral drugs research.
Assuntos
Antirretrovirais/química , Química Farmacêutica/métodos , Simulação por Computador , Mineração de Dados/métodos , Bases de Dados Factuais , Aprendizado de Máquina , Modelos Teóricos , Humanos , Redes Neurais de ComputaçãoRESUMO
The previously reported procedure to generate "universal" Generative Topographic Maps (GTMs) of the drug-like chemical space is in practice a multi-task learning process, in which both operational GTM parameters (example: map grid size) and hyperparameters (key example: the molecular descriptor space to be used) are being chosen by an evolutionary process in order to fit/select "universal" GTM manifolds. After selection (a one-time task aimed at optimizing the compromise in terms of neighborhood behavior compliance, over a large pool of various biological targets), for any further use the manifolds are ready to provide "fit-free" predictive models. Using any structure-activity set-irrespectively whether the associated target served at map fitting stage or not-the generation or "coloring" a property landscape enables predicting the property for any external molecule, with zero additional fitable parameters involved. While previous works have signaled the excellent behavior of such models in aggressive three-fold cross-validation assessments of their predictive power, the present work wished to explore their behavior in Virtual Screening (VS), here simulated on hand of external DUD ligand and decoy series that are fully disjoint from the ChEMBL-extracted landscape coloring sets. Beyond the rather robust results of the universal GTM manifolds in this challenge, it could be shown that the descriptor spaces selected by the evolutionary multi-task learner were intrinsically able to serve as an excellent support for many other VS procedures, starting from parameter-free similarity searching, to local (target-specific) GTM models, to parameter-rich, nonlinear Random Forest and Neural Network approaches.
Assuntos
Modelos Moleculares , Proteínas/química , Bases de Dados de Proteínas , Ligantes , Redes Neurais de Computação , Ligação Proteica , Conformação Proteica , Relação Estrutura-AtividadeRESUMO
Metabolic stability is an important parameter to be optimized during the complex process of designing new active compounds. Tuning this parameter with the simultaneous maintenance of a desired compound's activity is not an easy task due to the extreme complexity of metabolic pathways in living organisms. In this study, the platform for in silico qualitative evaluation of metabolic stability, expressed as half-lifetime and clearance was developed. The platform is based on the application of machine learning methods and separate models for human, rat and mouse data were constructed. The compounds' evaluation is qualitative and two types of experiments can be performed-regression, which is when the compound is assigned to one of the metabolic stability classes (low, medium, high) on the basis of numerical value of the predicted half-lifetime, and classification, in which the molecule is directly assessed as low, medium or high stability. The results show that the models have good predictive power, with accuracy values over 0.7 for all cases, for Sequential Minimal Optimization (SMO), k-nearest neighbor (IBk) and Random Forest algorithms. Additionally, for each of the analyzed compounds, 10 of the most similar structures from the training set (in terms of Tanimoto metric similarity) are identified and made available for download as separate files for more detailed manual inspection. The predictive power of the models was confronted with the external dataset, containing metabolic stability assessment via the GUSAR software, leading to good consistency of results for SMOreg and Naïve Bayes (~0.8 on average). The tool is available online.
Assuntos
Simulação por Computador , Aprendizado de Máquina , Software , Algoritmos , Animais , Teorema de Bayes , Bases de Dados Factuais , HumanosRESUMO
BACKGROUND: Aiming to understand cellular responses to different perturbations, the NIH Common Fund Library of Integrated Network-based Cellular Signatures (LINCS) program involves many institutes and laboratories working on over a thousand cell lines. The community-based Cell Line Ontology (CLO) is selected as the default ontology for LINCS cell line representation and integration. RESULTS: CLO has consistently represented all 1097 LINCS cell lines and included information extracted from the LINCS Data Portal and ChEMBL. Using MCF 10A cell line cells as an example, we demonstrated how to ontologically model LINCS cellular signatures such as their non-tumorigenic epithelial cell type, three-dimensional growth, latrunculin-A-induced actin depolymerization and apoptosis, and cell line transfection. A CLO subset view of LINCS cell lines, named LINCS-CLOview, was generated to support systematic LINCS cell line analysis and queries. In summary, LINCS cell lines are currently associated with 43 cell types, 131 tissues and organs, and 121 cancer types. The LINCS-CLO view information can be queried using SPARQL scripts. CONCLUSIONS: CLO was used to support ontological representation, integration, and analysis of over a thousand LINCS cell line cells and their cellular responses.
Assuntos
Mama/metabolismo , Biologia Computacional/métodos , Regulação da Expressão Gênica , Ensaios de Triagem em Larga Escala , Neoplasias/genética , Apoptose/efeitos dos fármacos , Mama/citologia , Mama/efeitos dos fármacos , Linhagem Celular , Células Cultivadas , Feminino , Perfilação da Expressão Gênica , Humanos , Macrolídeos/farmacologia , Neoplasias/tratamento farmacológico , Neoplasias/patologia , Tiazolidinas/farmacologiaRESUMO
Exponential growth in the number of compounds with experimentally verified activity towards particular target has led to the emergence of various databases gathering data on biological activity. In this study, the ligands of family A of the G Protein-Coupled Receptors that are collected in the ChEMBL database were examined, and special attention was given to serotonin receptors. Sets of compounds were examined in terms of their appearance over time, they were mapped to the chemical space of drugs deposited in DrugBank, and the emergence of structurally new clusters of compounds was indicated. In addition, a tool for detailed analysis of the obtained visualizations was prepared and made available online at http://chem.gmum.net/vischem, which enables the investigation of chemical structures while referring to particular data points depicted in the figures and changes in compounds datasets over time.
Assuntos
Ligantes , Receptores Acoplados a Proteínas G/metabolismo , Bases de Dados de Compostos Químicos , Internet , Ligação Proteica , Receptores Acoplados a Proteínas G/química , Receptores de Serotonina/química , Receptores de Serotonina/metabolismo , Interface Usuário-ComputadorRESUMO
Molecular docking, 3D-QSAR CoMSIA and similarity search were combined in a multi-step framework with the ultimate goal to identify potent indole analogs, in the ChEMBL database, as inhibitors of HCV replication. The crystal structure of HCV RNA-dependent RNA polymerase (NS5B GT1b) was utilized and 41 known inhibitors were docked into the enzyme "Palm II" active site. In a second step, the docking pose of each compound was used in a receptor-based alignment for the generation of the CoMSIA fields. A validated 3D-QSAR CoMSIA model was subsequently built to accurately estimate the activity values. The proposed framework gives insight into the structural characteristics that affect the binding and the inhibitory activity of these analogs on HCV polymerase. The obtained in silico model was used to predict the activity of novel compounds prior to their synthesis and biological testing, within a Virtual Screening framework. The ChEMBL database was mined to afford compounds containing the indole scaffold that are predicted to possess high activity and thus can be prioritized for biological screening.
Assuntos
Bases de Dados de Compostos Químicos , Descoberta de Drogas/métodos , Hepacivirus/efeitos dos fármacos , Replicação Viral/efeitos dos fármacos , Hepacivirus/fisiologia , Indóis/química , Indóis/farmacologia , Simulação de Acoplamento Molecular , Relação Quantitativa Estrutura-Atividade , RNA Polimerase Dependente de RNA/antagonistas & inibidores , RNA Polimerase Dependente de RNA/químicaRESUMO
During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on "virtual" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.
RESUMO
Membrane permeability is an in vitro parameter that represents the apparent permeability (Papp) of a compound, and is a key absorption, distribution, metabolism, and excretion parameter in drug development. Although the Caco-2 cell lines are the most used cell lines to measure Papp, other cell lines, such as the Madin-Darby Canine Kidney (MDCK), LLC-Pig Kidney 1 (LLC-PK1), and Ralph Russ Canine Kidney (RRCK) cell lines, can also be used to estimate Papp. Therefore, constructing in silico models for Papp estimation using the MDCK, LLC-PK1, and RRCK cell lines requires collecting extensive amounts of in vitro Papp data. An open database offers extensive measurements of various compounds covering a vast chemical space; however, concerns were reported on the use of data published in open databases without the appropriate accuracy and quality checks. Ensuring the quality of datasets for training in silico models is critical because artificial intelligence (AI, including deep learning) was used to develop models to predict various pharmacokinetic properties, and data quality affects the performance of these models. Hence, careful curation of the collected data is imperative. Herein, we developed a new workflow that supports automatic curation of Papp data measured in the MDCK, LLC-PK1, and RRCK cell lines collected from ChEMBL using KNIME. The workflow consisted of four main phases. Data were extracted from ChEMBL and filtered to identify the target protocols. A total of 1661 high-quality entries were retained after checking 436 articles. The workflow is freely available, can be updated, and has high reusability. Our study provides a novel approach for data quality analysis and accelerates the development of helpful in silico models for effective drug discovery. Scientific Contribution: The cost of building highly accurate predictive models can be significantly reduced by automating the collection of reliable measurement data. Our tool reduces the time and effort required for data collection and will enable researchers to focus on constructing high-performance in silico models for other types of analysis. To the best of our knowledge, no such tool is available in the literature.
RESUMO
In order to analyze the Chimiothèque Nationale (CN) - The French National Compound Library - in the context of screening and biologically relevant compounds, the library was compared with ZINC in-stock collection and ChEMBL. This includes the study of chemical space coverage, physicochemical properties and Bemis-Murcko (BM) scaffold populations. More than 5â K CN-unique scaffolds (relative to ZINC and ChEMBL collections) were identified. Generative Topographic Maps (GTMs) accommodating those libraries were generated and used to compare the compound populations. Hierarchical GTM («zooming¼) was applied to generate an ensemble of maps at various resolution levels, from global overview to precise mapping of individual structures. The respective maps were added to the ChemSpace Atlas website. The analysis of synthetic accessibility in the context of combinatorial chemistry showed that only 29,7 % of CN compounds can be fully synthesized using commercially available building blocks.
Assuntos
Bases de Dados de Compostos QuímicosRESUMO
Diabetes is a chronic hyperglycemic disorder that leads to a group of metabolic diseases. This condition of chronic hyperglycemia is caused by abnormal insulin levels. The impact of hyperglycemia on the human vascular tree is the leading cause of disease and death in type 1 and type 2 diabetes. People with type 2 diabetes mellitus (T2DM) have abnormal secretion as well as the action of insulin. Type 2 (non-insulin-dependent) diabetes is caused by a combination of genetic factors associated with decreased insulin production, insulin resistance, and environmental conditions. These conditions include overeating, lack of exercise, obesity, and aging. Glucose transport limits the rate of dietary glucose used by fat and muscle. The glucose transporter GLUT4 is kept intracellular and sorted dynamically, and GLUT4 translocation or insulin-regulated vesicular traffic distributes it to the plasma membrane. Different chemical compounds have antidiabetic properties. The complexity, metabolism, digestion, and interaction of these chemical compounds make it difficult to understand and apply them to reduce chronic inflammation and thus prevent chronic disease. In this study, we have applied a virtual screening approach to screen the most suitable and drug-able chemical compounds to be used as potential drug targets against T2DM. We have found that out of 5000 chemical compounds that we have analyzed, only two are known to be more effective as per our experiments based upon molecular docking studies and virtual screening through Lipinski's rule and ADMET properties.
RESUMO
Public repositories containing compound-bioactivity data for millions of small molecules offer a valuable resource for chemogenomic compound candidate search. Nonetheless, owning to nonuniform data mining, these databases are often incomplete, thus advocating the combined use of data from several repositories to increase target coverage and data accuracy. Here, we present a workflow to generate custom datasets from public databases for mining chemogenomic compound candidates. The compiled set provides flags for differences in structural and bioactivity data and enables rapid extraction of potent and selective bioactive compounds.