RESUMO
BACKGROUND: Warfarin is a common oral anticoagulant, and its effects vary widely among individuals. Numerous dose-prediction algorithms have been reported based on cross-sectional data generated via multiple linear regression or machine learning. This study aimed to construct an information fusion perturbation theory and machine learning prediction model of warfarin blood levels based on clinical longitudinal data from cardiac surgery patients. METHODS AND MATERIAL: The data of 246 patients were obtained from electronic medical records. Continuous variables were processed by calculating the distance of the raw data with the moving average (MA ∆vki(sj)), and categorical variables in different attribute groups were processed using Euclidean distance (ED Ç∆vk(sj)Ç). Regression and classification analyses were performed on the raw data, MA ∆vki(sj), and ED Ç∆vk(sj)Ç. Different machine-learning algorithms were chosen for the STATISTICA and WEKA software. RESULTS: The random forest (RF) algorithm was the best for predicting continuous outputs using the raw data. The correlation coefficients of the RF algorithm were 0.978 and 0.595 for the training and validation sets, respectively, and the mean absolute errors were 0.135 and 0.362 for the training and validation sets, respectively. The proportion of ideal predictions of the RF algorithm was 59.0%. General discriminant analysis (GDA) was the best algorithm for predicting the categorical outputs using the MA ∆vki(sj) data. The GDA algorithm's total true positive rate (TPR) was 95.4% and 95.6% for the training and validation sets, respectively, with MA ∆vki(sj) data. CONCLUSIONS: An information fusion perturbation theory and machine learning model for predicting warfarin blood levels was established. A model based on the RF algorithm could be used to predict the target international normalized ratio (INR), and a model based on the GDA algorithm could be used to predict the probability of being within the target INR range under different clinical scenarios.
RESUMO
Familial hypercholesterolemia (FH) is an inherited metabolic disease affecting cholesterol metabolism, with 90% of cases caused by mutations in the LDL receptor gene (LDLR), primarily missense mutations. This study aims to integrate six commonly used predictive software to create a new model for predicting LDLR mutation pathogenicity and mapping hot spot residues. Six predictive-software are selected: Polyphen-2, SIFT, MutationTaster, REVEL, VARITY, and MLb-LDLr. Software accuracy is tested with the characterized variants annotated in ClinVar and, by bioinformatic and machine learning techniques all models are integrated into a more accurate one. The resulting optimized model presents a specificity of 96.71% and a sensitivity of 98.36%. Hot spot residues with high potential of pathogenicity appear across all domains except for the signal peptide and the O-linked domain. In addition, translating this information into 3D structure of the LDLr highlights potentially pathogenic clusters within the different domains, which may be related to specific biological function. The results of this work provide a powerful tool to classify LDLR pathogenic variants. Moreover, an open-access guide user interface (OptiMo-LDLr) is provided to the scientific community. This study shows that combination of several predictive software results in a more accurate prediction to help clinicians in FH diagnosis.
Assuntos
Hiperlipoproteinemia Tipo II , Humanos , Fenótipo , Mutação , Hiperlipoproteinemia Tipo II/diagnóstico , Hiperlipoproteinemia Tipo II/genética , Receptores de LDL/genética , Receptores de LDL/metabolismo , Simulação por ComputadorRESUMO
The theoretical prediction of drug-decorated nanoparticles (DDNPs) has become a very important task in medical applications. For the current paper, Perturbation Theory Machine Learning (PTML) models were built to predict the probability of different pairs of drugs and nanoparticles creating DDNP complexes with anti-glioblastoma activity. PTML models use the perturbations of molecular descriptors of drugs and nanoparticles as inputs in experimental conditions. The raw dataset was obtained by mixing the nanoparticle experimental data with drug assays from the ChEMBL database. Ten types of machine learning methods have been tested. Only 41 features have been selected for 855,129 drug-nanoparticle complexes. The best model was obtained with the Bagging classifier, an ensemble meta-estimator based on 20 decision trees, with an area under the receiver operating characteristic curve (AUROC) of 0.96, and an accuracy of 87% (test subset). This model could be useful for the virtual screening of nanoparticle-drug complexes in glioblastoma. All the calculations can be reproduced with the datasets and python scripts, which are freely available as a GitHub repository from authors.
Assuntos
Antineoplásicos/administração & dosagem , Neoplasias Encefálicas/tratamento farmacológico , Sistemas de Liberação de Medicamentos , Glioblastoma/tratamento farmacológico , Aprendizado de Máquina , Nanopartículas , Bases de Dados de Compostos Químicos , Bases de Dados de Produtos Farmacêuticos , Portadores de Fármacos/administração & dosagem , Desenho de Fármacos , Ensaios de Seleção de Medicamentos Antitumorais , Humanos , Nanopartículas/administração & dosagem , Interface Usuário-ComputadorRESUMO
Sarcomas are a group of malignant neoplasms of connective tissue with a different etiology than carcinomas. The efforts to discover new drugs with antisarcoma activity have generated large datasets of multiple preclinical assays with different experimental conditions. For instance, the ChEMBL database contains outcomes of 37,919 different antisarcoma assays with 34,955 different chemical compounds. Furthermore, the experimental conditions reported in this dataset include 157 types of biological activity parameters, 36 drug targets, 43 cell lines, and 17 assay organisms. Considering this information, we propose combining perturbation theory (PT) principles with machine learning (ML) to develop a PTML model to predict antisarcoma compounds. PTML models use one function of reference that measures the probability of a drug being active under certain conditions (protein, cell line, organism, etc.). In this paper, we used a linear discriminant analysis and neural network to train and compare PT and non-PT models. All the explored models have an accuracy of 89.19-95.25% for training and 89.22-95.46% in validation sets. PTML-based strategies have similar accuracy but generate simplest models. Therefore, they may become a versatile tool for predicting antisarcoma compounds.
RESUMO
Nanosystems are gaining momentum in pharmaceutical sciences because of the wide variety of possibilities for designing these systems to have specific functions. Specifically, studies of new cancer cotherapy drug-vitamin release nanosystems (DVRNs) including anticancer compounds and vitamins or vitamin derivatives have revealed encouraging results. However, the number of possible combinations of design and synthesis conditions is remarkably high. In addition, a large number of anticancer and vitamin derivatives have been already assayed, but a notably less number of cases of DVRNs were assayed as a whole (with the anticancer compound and the vitamin linked to them). Our approach combines with the perturbation theory and machine learning (PTML) model to predict the probability of obtaining an interesting DVRN by changing the anticancer compound and/or the vitamin present in a DVRN that is already tested for other anticancer compounds or vitamins that have not been tested yet as part of a DVRN. In a previous work, we built a linear PTML model useful for the design of these nanosystems. In doing so, we used information fusion (IF) techniques to carry out data enrichment of DVRN data compiled from the literature with the data for preclinical assays of vitamins from the ChEMBL database. The design features of DVRNs and the assay conditions of nanoparticles (NPs) and vitamins were included as multiplicative PT operators (PTOs) to the system, which indicates the importance of these variables. However, the previous work omitted experiments with nonlinear ML techniques and different types of PTOs such as metric-based PTOs. More importantly, the previous work does not consider the structure of the anticancer drug to be included in the new DVRNs. In this work, we are going to accomplish three main objectives (tasks). In the first task, we found a new model, alternative to the one published before, for the rational design of DVRNs using metric-based PTOs. The most accurate PTML model was the artificial neural network model, which showed values of specificity, sensitivity, and accuracy in the range of 90-95% in training and external validation series for more than 130,000 cases (DVRNs vs ChEMBL assays). Furthermore, in the second task, we used IF techniques to carry out data enrichment of our previous data set. In doing so, we constructed a new working data set of >970,000 cases with the data of preclinical assays of DVRNs, vitamins, and anticancer compounds from the ChEMBL database. All these assays have multiple continuous variables or descriptors dk and categorical variables cj (conditions of the assays) for drugs (dack, cacj), vitamins (dvk, cvj), and NPs (dnk, cnj). These data include >20,000 potential anticancer compounds with >270 protein targets (cac1), >580 assay cell organisms (cac2), and so forth. Furthermore, we include >36,000 assay vitamin derivatives in >6200 types of cells (c2vit), >120 assay organisms (c3vit), >60 assay strains (c4vit), and so forth. The enriched data set also contains >20 types of DVRNs (c5n) with 9 NP core materials (c4n), 8 synthesis methods (c7n), and so forth. We expressed all this information with PTOs and developed a qualitatively new PTML model that incorporates information of the anticancer drugs. This new model presents 96-97% of accuracy for training and external validation subsets. In the last task, we carried out a comparative study of ML and/or PTML models published and described how the models we are presenting cover the gap of knowledge in terms of drug delivery. In conclusion, we present here for the first time a multipurpose PTML model that is able to select NPs, anticancer compounds, and vitamins and their conditions of assay for DVRN design.
Assuntos
Antineoplásicos/administração & dosagem , Protocolos de Quimioterapia Combinada Antineoplásica/administração & dosagem , Sistemas de Liberação de Medicamentos/métodos , Nanopartículas/química , Neoplasias/tratamento farmacológico , Vitaminas/administração & dosagem , Big Data , Simulação por Computador , Bases de Dados Factuais , Liberação Controlada de Fármacos , Modelos Lineares , Aprendizado de MáquinaRESUMO
Determining the biological activity of vitamin derivatives is needed given that organic synthesis of analogs of vitamins is an active field of interest for medicinal chemistry, pharmaceuticals, and food additives. Accordingly, scientists from different disciplines perform preclinical assays (nij) with a considerable combination of assay conditions (cj). Indeed, the ChEMBL platform contains a database that includes results from 36â¯220 different biological activity bioassays of 21â¯240 different vitamins and vitamin derivatives. These assays present are heterogeneous in terms of assay combinations of cj. They are focused on >500 different biological activity parameters (c0), >340 different targets (c1), >6200 types of cell (c2), >120 organisms of assay (c3), and >60 assay strains (c4). It includes a total of >1850 niacin assays, >1580 tretinoin assays, >1580 retinol assays, 857 ascorbic acid assays, etc. Given the complexity of this combinatorial data in terms of being assimilated by researchers, we propose to build a model by combining perturbation theory (PT) and machine learning (ML). Through this study, we propose a PTML (PT + ML) combinatorial model for ChEMBL results on biological activity of vitamins and vitamins derivatives. The linear discriminant analysis (LDA) model presented the following results for training subset a: specificity (%) = 90.38, sensitivity (%) = 87.51, and accuracy (%) = 89.89. The model showed the following results for the external validation subset: specificity (%) = 90.58, sensitivity (%) = 87.72, and accuracy (%) = 90.09. Different types of linear and nonlinear PTML models, such as logistic regression (LR), classification tree (CT), näive Bayes (NB), and random Forest (RF), were applied to contrast the capacity of prediction. The PTML-LDA model predicts with more accuracy by applying combinatorial descriptors. In addition, a PCA experiment with chemical structure descriptors allowed us to characterize the high structural diversity of the chemical space studied. In any case, PTML models using chemical structure descriptors do not improve the performance of the PTML-LDA model based on ALOGP and PSA. We can conclude that the three variable PTML-LDA model is a simplified and adaptable tool for the prediction, for different experiment combinations, the biological activity of derivative vitamins.
Assuntos
Teorema de Bayes , Técnicas de Química Combinatória , Aprendizado de Máquina , Modelos Estatísticos , Vitaminas/química , Bases de Dados Factuais , Estrutura Molecular , Vitaminas/síntese químicaRESUMO
AIMS: Cheminformatics models are able to predict different outputs (activity, property, chemical reactivity) in single molecules or complex molecular systems (catalyzed organic synthesis, metabolic reactions, nanoparticles, etc.). BACKGROUND: Cheminformatics models are able to predict different outputs (activity, property, chemical reactivity) in single molecules or complex molecular systems (catalyzed organic synthesis, metabolic reactions, nanoparticles, etc.). OBJECTIVE: Cheminformatics prediction of complex catalytic enantioselective reactions is a major goal in organic synthesis research and chemical industry. Markov Chain Molecular Descriptors (MCDs) have been largely used to solve Cheminformatics problems. There are different types of Markov chain descriptors such as Markov-Shannon entropies (Shk), Markov Means (Mk), Markov Moments (πk), etc. However, there are other possible MCDs that have not been used before. In addition, the calculation of MCDs is done very often using specific software not always available for general users and there is not an R library public available for the calculation of MCDs. This fact, limits the availability of MCMDbased Cheminformatics procedures. METHODS: We studied the enantiomeric excess ee(%)[Rcat] for 324 α-amidoalkylation reactions. These reactions have a complex mechanism depending on various factors. The model includes MCDs of the substrate, solvent, chiral catalyst, product along with values of time of reaction, temperature, load of catalyst, etc. We tested several Machine Learning regression algorithms. The Random Forest regression model has R2 > 0.90 in training and test. Secondly, the biological activity of 5644 compounds against colorectal cancer was studied. RESULTS: We developed very interesting model able to predict with Specificity and Sensitivity 70-82% the cases of preclinical assays in both training and validation series. CONCLUSION: The work shows the potential of the new tool for computational studies in organic and medicinal chemistry.
Assuntos
Quimioinformática , Química Farmacêutica , Cadeias de Markov , Algoritmos , Humanos , Aprendizado de MáquinaRESUMO
Nano-systems for cancer co-therapy including vitamins or vitamin derivatives have showed adequate results to continue with further research studies to better understand them. However, the number of different combinations of drugs, vitamins, nanoparticle types, coating agents, synthesis conditions, and system types (nanocapsules, micelles, etc.) to be tested is very large generating a high cost in experimentations. In this context, there are reports of large datasets of preclinical assays of compounds (like in the ChEMBL database) and increasing but yet limited reports of experimental measurements of nano-systems per se. On the other hand, Machine Learning is gaining momentum in Nanotechnology and Pharmaceutical Sciences as a tool for rational design of new drugs and drug-release nano-systems. In this work, we propose to combine Perturbation Theory principles and Machine Learning to develop a PTML model for rational selection of the components of cancer co-therapy drug-vitamin release nano-systems (DVRNs). In doing so, we apply information fusion techniques with 2 data sets: (1) a large ChEMBL dataset of >36 000 preclinical assays of vitamin derivatives and a new dataset of >1000 outcomes of DVRNs, collected herein from the literature for the first time. The ChEMBL dataset used covers a considerable number of assay conditions (cjvit) each one with multiple levels. These conditions included >504 biological activity parameters (c0vit), >340 types of proteins (c1vit), >650 types of cells (c2vit), >120 assay organisms (c3vit), >60 assay strains (c4vit). Regarding the DVRNs, there are 25 different types of nano-systems (njn), with up to 16 conditions (cjn) including also different levels such as 8 biological activity parameters (c0n), 9 raw nanomaterials (c4n), 15 assay cells (c11n), etc. In the first stage, we used Moving Average operators to quantify the perturbations (deviations) in all input variables with respect to the conditions. After that, we used multiplicative PT operators to carry out data fusion, and dimension reduction, and Linear Discriminant Analysis (LDA) to seek the PTML model. The best PTML model found showed values of specificity, sensitivity, and accuracy in the range of 83-88% in training and external validation series for >130 000 cases (DVRNs vs. ChEMBL data pairs) formed after data fusion. To the best of our knowledge, this is the first general purpose model for the rational design of DVRNs for cancer co-therapy.
Assuntos
Sistemas de Liberação de Medicamentos , Aprendizado de Máquina , Modelos Biológicos , Nanopartículas , Neoplasias , Vitaminas , Humanos , Micelas , Nanopartículas/química , Nanopartículas/uso terapêutico , Neoplasias/tratamento farmacológico , Neoplasias/metabolismo , Neoplasias/patologia , Vitaminas/química , Vitaminas/farmacocinética , Vitaminas/farmacologiaRESUMO
Determining the target proteins of new anticancer compounds is a very important task in Medicinal Chemistry. In this sense, chemists carry out preclinical assays with a high number of combinations of experimental conditions (c j). In fact, ChEMBL database contains outcomes of 65â¯534 different anticancer activity preclinical assays for 35â¯565 different chemical compounds (1.84 assays per compound). These assays cover different combinations of c j formed from >70 different biological activity parameters ( c0), >300 different drug targets ( c1), >230 cell lines ( c2), and 5 organisms of assay ( c3) or organisms of the target ( c4). It include a total of 45â¯833 assays in leukemia, 6227 assays in breast cancer, 2499 assays in ovarian cancer, 3499 in colon cancer, 3159 in lung cancer, 2750 in prostate cancer, 601 in melanoma, etc. This is a very complex data set with multiple Big Data features. This data is hard to be rationalized by researchers to extract useful relationships and predict new compounds. In this context, we propose to combine perturbation theory (PT) ideas and machine learning (ML) modeling to solve this combinatorial-like problem. In this work, we report a PTML (PT + ML) model for ChEMBL data set of preclinical assays of anticancer compounds. This is a simple linear model with only three variables. The model presented values of area under receiver operating curve = AUROC = 0.872, specificity = Sp(%) = 90.2, sensitivity = Sn(%) = 70.6, and overall accuracy = Ac(%) = 87.7 in training series. The model also have Sp(%) = 90.1, Sn(%) = 71.4, and Ac(%) = 87.8 in external validation series. The model use PT operators based on multicondition moving averages to capture all the complexity of the data set. We also compared the model with nonlinear artificial neural network (ANN) models obtaining similar results. This confirms the hypothesis of a linear relationship between the PT operators and the classification as anticancer compounds in different combinations of assay conditions. Last, we compared the model with other PTML models reported in the literature concluding that this is the only one PTML model able to predict activity against multiple types of cancer. This model is a simple but versatile tool for the prediction of the targets of anticancer compounds taking into consideration multiple combinations of experimental conditions in preclinical assays.