RESUMEN
BACKGROUND: In this study, we present a SVM-based ranking algorithm for the concurrent learning of compounds with different activity profiles and their varying prioritization. To this end, a specific labeling of each compound was elaborated in order to infer virtual screening models against multiple targets. We compared the method with several state-of-the-art SVM classification techniques that are capable of inferring multi-target screening models on three chemical data sets (cytochrome P450s, dehydrogenases, and a trypsin-like protease data set) containing three different biological targets each. RESULTS: The experiments show that ranking-based algorithms show an increased performance for single- and multi-target virtual screening. Moreover, compounds that do not completely fulfill the desired activity profile are still ranked higher than decoys or compounds with an entirely undesired profile, compared to other multi-target SVM methods. CONCLUSIONS: SVM-based ranking methods constitute a valuable approach for virtual screening in multi-target drug design. The utilization of such methods is most helpful when dealing with compounds with various activity profiles and the finding of many ligands with an already perfectly matching activity profile is not to be expected.
RESUMEN
In systems biology, the combination of multiple types of omics data, such as metabolomics, proteomics, transcriptomics, and genomics, yields more information on a biological process than the analysis of a single type of data. Thus, data from different omics platforms is usually combined in one experimental setup to obtain insight into a biological process or a disease state. Particularly high accuracy metabolomics data from modern mass spectrometry instruments is currently more and more integrated into biological studies. Reflecting this trend, we extended InCroMAP, a data integration, analysis and visualization tool for genomics, transcriptomics, and proteomics data. Now, the tool is able to perform an integrated enrichment analysis and pathway-based visualization of multi-omics data and thus, it is suitable for the evaluation of comprehensive systems biology studies.
Asunto(s)
Biología Computacional/métodos , Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Programas Informáticos , Análisis por Micromatrices , Interfaz Usuario-ComputadorRESUMEN
BACKGROUND: A plethora of studies indicate that the development of multi-target drugs is beneficial for complex diseases like cancer. Accurate QSAR models for each of the desired targets assist the optimization of a lead candidate by the prediction of affinity profiles. Often, the targets of a multi-target drug are sufficiently similar such that, in principle, knowledge can be transferred between the QSAR models to improve the model accuracy. In this study, we present two different multi-task algorithms from the field of transfer learning that can exploit the similarity between several targets to transfer knowledge between the target specific QSAR models. RESULTS: We evaluated the two methods on simulated data and a data set of 112 human kinases assembled from the public database ChEMBL. The relatedness between the kinase targets was derived from the taxonomy of the humane kinome. The experiments show that multi-task learning increases the performance compared to training separate models on both types of data given a sufficient similarity between the tasks. On the kinase data, the best multi-task approach improved the mean squared error of the QSAR models of 58 kinase targets. CONCLUSIONS: Multi-task learning is a valuable approach for inferring multi-target QSAR models for lead optimization. The application of multi-task learning is most beneficial if knowledge can be transferred from a similar task with a lot of in-domain knowledge to a task with little in-domain knowledge. Furthermore, the benefit increases with a decreasing overlap between the chemical space spanned by the tasks.
RESUMEN
OBJECTIVE: Nonalcoholic fatty liver (NAFL) is thought to contribute to insulin resistance and its metabolic complications. However, some individuals with NAFL remain insulin sensitive. Mechanisms involved in the susceptibility to develop insulin resistance in humans with NAFL are largely unknown. We investigated circulating markers and mechanisms of a metabolically benign and malignant NAFL by applying a metabolomic approach. RESEARCH DESIGN AND METHODS: A total of 265 metabolites were analyzed before and after a 9-month lifestyle intervention in plasma from 20 insulin-sensitive and 20 insulin-resistant subjects with NAFL. The relevant plasma metabolites were then tested for relationships with insulin sensitivity in 17 subjects without NAFL and in plasma from 29 subjects with liver tissue samples. RESULTS: The best separation of the insulin-sensitive from the insulin-resistant NAFL group was achieved by a metabolite pattern including the branched-chain amino acids leucine and isoleucine, ornithine, the acylcarnitines C3:0-, C16:0-, and C18:0-carnitine, and lysophosphatidylcholine (lyso-PC) C16:0 (area under the ROC curve, 0.77 [P = 0.00023] at baseline and 0.80 [P = 0.000019] at follow-up). Among the individual metabolites, predominantly higher levels of lyso-PC C16:0, both at baseline (P = 0.0039) and at follow-up (P = 0.001), were found in the insulin-sensitive compared with the insulin-resistant subjects. In the non-NAFL groups, no differences in lyso-PC C16:0 levels were found between the insulin-sensitive and insulin-resistant subjects, and these relationships were replicated in plasma from subjects with liver tissue samples. CONCLUSIONS: From a plasma metabolomic pattern, particularly lyso-PCs are able to separate metabolically benign from malignant NAFL in humans and may highlight important pathways in the pathogenesis of fatty liver-induced insulin resistance.
Asunto(s)
Biomarcadores/sangre , Hígado Graso/sangre , Resistencia a la Insulina , Lisofosfatidilcolinas/sangre , Adulto , Hígado Graso/fisiopatología , Femenino , Humanos , Hígado/metabolismo , Masculino , Metabolómica , Persona de Mediana Edad , Enfermedad del Hígado Graso no AlcohólicoRESUMEN
BACKGROUND: Ligand-based virtual screening plays a fundamental part in the early drug discovery stage. In a virtual screening, a chemical library is searched for molecules with similar properties to a query molecule by means of a similarity function. The optimal assignment of chemical graphs has proven to be a valuable similarity function for many cheminformatic tasks, such as virtual screening. The optimal assignment assumes all atoms of a query molecule to be equally important, which is not realistic depending on the binding mode of a ligand. The importance of a query molecule's atoms can be integrated in the optimal assignment by weighting the assignment edges. We optimized the edge weights with respect to the virtual screening performance by means of evolutionary algorithms. Furthermore, we propose a visualization approach for the interpretation of the edge weights. RESULTS: We evaluated two different evolutionary algorithms, differential evolution and particle swarm optimization, for their suitability for optimizing the assignment edge weights. The results showed that both optimization methods are suited to optimize the edge weights. Furthermore, we compared our approach to the optimal assignment with equal edge weights and two literature similarity functions on a subset of the Directory of Useful Decoys using sophisticated virtual screening performance metrics. Our approach achieved a considerably better overall and early enrichment performance. The visualization of the edge weights enables the identification of substructures that are important for a good retrieval of ligands and for the binding to the protein target. CONCLUSIONS: The optimization of the edge weights in optimal assignment methods is a valuable approach for ligand-based virtual screening experiments. The approach can be applied to any similarity function that employs the optimal assignment method, which includes a variety of similarity measures that have proven to be valuable in various cheminformatic tasks. The proposed visualization helps to get a better understanding of the binding mode of the analyzed query molecule.
RESUMEN
BACKGROUND: Metabolomics is a powerful tool that is increasingly used in clinical research. Although excellent sample quality is essential, it can easily be compromised by undetected preanalytical errors. We set out to identify critical preanalytical steps and biomarkers that reflect preanalytical inaccuracies. METHODS: We systematically investigated the effects of preanalytical variables (blood collection tubes, hemolysis, temperature and time before further processing, and number of freeze-thaw cycles) on metabolomics studies of clinical blood and plasma samples using a nontargeted LC-MS approach. RESULTS: Serum and heparinate blood collection tubes led to chemical noise in the mass spectra. Distinct, significant changes of 64 features in the EDTA-plasma metabolome were detected when blood was exposed to room temperature for 2, 4, 8, and 24 h. The resulting pattern was characterized by increases in hypoxanthine and sphingosine 1-phosphate (800% and 380%, respectively, at 2 h). In contrast, the plasma metabolome was stable for up to 4 h when EDTA blood samples were immediately placed in iced water. Hemolysis also caused numerous changes in the metabolic profile. Unexpectedly, up to 4 freeze-thaw cycles only slightly changed the EDTA-plasma metabolome, but increased the individual variability. CONCLUSIONS: Nontargeted metabolomics investigations led to the following recommendations for the preanalytical phase: test the blood collection tubes, avoid hemolysis, place whole blood immediately in ice water, use EDTA plasma, and preferably use nonrefrozen biobank samples. To exclude outliers due to preanalytical errors, inspect the biomarker signal intensities reflecting systematic as well as accidental and preanalytical inaccuracies before processing the bioinformatics data.
Asunto(s)
Análisis Químico de la Sangre/métodos , Metabolómica/métodos , Manejo de Especímenes/normas , Análisis Químico de la Sangre/normas , Cromatografía Liquida , Hemólisis , Humanos , Metaboloma , Metabolómica/normas , Análisis de Componente Principal , Control de Calidad , Manejo de Especímenes/métodos , Espectrometría de Masas en TándemRESUMEN
BACKGROUND: The performance of 3D-based virtual screening similarity functions is affected by the applied conformations of compounds. Therefore, the results of 3D approaches are often less robust than 2D approaches. The application of 3D methods on multiple conformer data sets normally reduces this weakness, but entails a significant computational overhead. Therefore, we developed a special conformational space encoding by means of Gaussian mixture models and a similarity function that operates on these models. The application of a model-based encoding allows an efficient comparison of the conformational space of compounds. RESULTS: Comparisons of our 4D flexible atom-pair approach with over 15 state-of-the-art 2D- and 3D-based virtual screening similarity functions on the 40 data sets of the Directory of Useful Decoys show a robust performance of our approach. Even 3D-based approaches that operate on multiple conformers yield inferior results. The 4D flexible atom-pair method achieves an averaged AUC value of 0.78 on the filtered Directory of Useful Decoys data sets. The best 2D- and 3D-based approaches of this study yield an AUC value of 0.74 and 0.72, respectively. As a result, the 4D flexible atom-pair approach achieves an average rank of 1.25 with respect to 15 other state-of-the-art similarity functions and four different evaluation metrics. CONCLUSIONS: Our 4D method yields a robust performance on 40 pharmaceutically relevant targets. The conformational space encoding enables an efficient comparison of the conformational space. Therefore, the weakness of the 3D-based approaches on single conformations is circumvented. With over 100,000 similarity calculations on a single desktop CPU, the utilization of the 4D flexible atom-pair in real-world applications is feasible.
RESUMEN
BACKGROUND: Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity. RESULTS: We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor. CONCLUSIONS: In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.
RESUMEN
BACKGROUND: The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats. RESULTS: We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al. CONCLUSIONS: jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.
RESUMEN
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.