Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
1.
Nucleic Acids Res ; 47(W1): W605-W609, 2019 07 02.
Article in English | MEDLINE | ID: mdl-31114892

ABSTRACT

More and more affordable high-throughput techniques for measuring molecular features of biomedical samples have led to a huge increase in availability and size of different types of multi-omic datasets, containing, for example, genetic or histone modification data. Due to the multi-view characteristic of the data, established approaches for exploratory analysis are not directly applicable. Here we present web-rMKL, a web server that provides an integrative dimensionality reduction with subsequent clustering of samples based on data from multiple inputs. The underlying machine learning method rMKL-LPP performed best for clinical enrichment in a recent benchmark of state-of-the-art multi-view clustering algorithms. The method was introduced for a multi-omic cancer subtype discovery setting, however, it is not limited to this application scenario as exemplified by a presented use case for stem cell differentiation. web-rMKL offers an intuitive interface for uploading data and setting the parameters. rMKL-LPP runs on the back end and the user may receive notifications once the results are available. We also introduce a preprocessing tool for generating kernel matrices from tables containing numerical feature values. This program can be used to generate admissible input if no precomputed kernel matrices are available. The web server is freely available at web-rMKL.org.


Subject(s)
Machine Learning , Software , Cell Differentiation , Cluster Analysis , DNA Methylation , Gene Expression Profiling , Genomics , Humans , Internet , MicroRNAs/metabolism , Squamous Cell Carcinoma of Head and Neck/mortality , Stem Cells/cytology , Survival Analysis
2.
Bioinformatics ; 31(12): i268-75, 2015 Jun 15.
Article in English | MEDLINE | ID: mdl-26072491

ABSTRACT

MOTIVATION: Despite ongoing cancer research, available therapies are still limited in quantity and effectiveness, and making treatment decisions for individual patients remains a hard problem. Established subtypes, which help guide these decisions, are mainly based on individual data types. However, the analysis of multidimensional patient data involving the measurements of various molecular features could reveal intrinsic characteristics of the tumor. Large-scale projects accumulate this kind of data for various cancer types, but we still lack the computational methods to reliably integrate this information in a meaningful manner. Therefore, we apply and extend current multiple kernel learning for dimensionality reduction approaches. On the one hand, we add a regularization term to avoid overfitting during the optimization procedure, and on the other hand, we show that one can even use several kernels per data type and thereby alleviate the user from having to choose the best kernel functions and kernel parameters for each data type beforehand. RESULTS: We have identified biologically meaningful subgroups for five different cancer types. Survival analysis has revealed significant differences between the survival times of the identified subtypes, with P values comparable or even better than state-of-the-art methods. Moreover, our resulting subtypes reflect combined patterns from the different data sources, and we demonstrate that input kernel matrices with only little information have less impact on the integrated kernel matrix. Our subtypes show different responses to specific therapies, which could eventually assist in treatment decision making. AVAILABILITY AND IMPLEMENTATION: An executable is available upon request.


Subject(s)
Machine Learning , Neoplasms/classification , Algorithms , Cluster Analysis , Humans , Neoplasms/mortality , Neoplasms/therapy , Survival Analysis
3.
Nucleic Acids Res ; 42(9): e78, 2014 May.
Article in English | MEDLINE | ID: mdl-24682815

ABSTRACT

The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.


Subject(s)
Gene Expression Profiling , Algorithms , Animals , Cluster Analysis , Computer Simulation , Databases, Genetic , Gene Ontology , Humans , Models, Genetic , Oligonucleotide Array Sequence Analysis , Principal Component Analysis , Software
4.
Sci Total Environ ; 802: 149798, 2022 Jan 01.
Article in English | MEDLINE | ID: mdl-34454142

ABSTRACT

Rapid changes in microbial water quality in surface waters pose challenges for production of safe drinking water. If not treated to an acceptable level, microbial pathogens present in the drinking water can result in severe consequences for public health. The aim of this paper was to evaluate the suitability of data-driven models of different complexity for predicting the concentrations of E. coli in the river Göta älv at the water intake of the drinking water treatment plant in Gothenburg, Sweden. The objectives were to (i) assess how the complexity of the model affects the model performance; and (ii) identify relevant factors and assess their effect as predictors of E. coli levels. To forecast E. coli levels one day ahead, the data on laboratory measurements of E. coli and total coliforms, Colifast measurements of E. coli, water temperature, turbidity, precipitation, and water flow were used. The baseline approaches included Exponential Smoothing and ARIMA (Autoregressive Integrated Moving Average), which are commonly used univariate methods, and a naive baseline that used the previous observed value as its next prediction. Also, models common in the machine learning domain were included: LASSO (Least Absolute Shrinkage and Selection Operator) Regression and Random Forest, and a tool for optimising machine learning pipelines - TPOT (Tree-based Pipeline Optimization Tool). Also, a multivariate autoregressive model VAR (Vector Autoregression) was included. The models that included multiple predictors performed better than univariate models. Random Forest and TPOT resulted in higher performance but showed a tendency of overfitting. Water temperature, microbial concentrations upstream and at the water intake, and precipitation upstream were shown to be important predictors. Data-driven modelling enables water producers to interpret the measurements in the context of what concentrations can be expected based on the recent historic data, and thus identify unexplained deviations warranting further investigation of their origin.


Subject(s)
Drinking Water , Water Quality , Environmental Monitoring , Escherichia coli , Water Microbiology
5.
Nat Commun ; 13(1): 5099, 2022 08 30.
Article in English | MEDLINE | ID: mdl-36042233

ABSTRACT

Design of de novo synthetic regulatory DNA is a promising avenue to control gene expression in biotechnology and medicine. Using mutagenesis typically requires screening sizable random DNA libraries, which limits the designs to span merely a short section of the promoter and restricts their control of gene expression. Here, we prototype a deep learning strategy based on generative adversarial networks (GAN) by learning directly from genomic and transcriptomic data. Our ExpressionGAN can traverse the entire regulatory sequence-expression landscape in a gene-specific manner, generating regulatory DNA with prespecified target mRNA levels spanning the whole gene regulatory structure including coding and adjacent non-coding regions. Despite high sequence divergence from natural DNA, in vivo measurements show that 57% of the highly-expressed synthetic sequences surpass the expression levels of highly-expressed natural controls. This demonstrates the applicability and relevance of deep generative design to expand our knowledge and control of gene expression regulation in any desired organism, condition or tissue.


Subject(s)
Genome , Genomics , DNA/genetics , Gene Expression , Gene Expression Regulation
6.
J Integr Bioinform ; 14(2)2017 Jul 08.
Article in English | MEDLINE | ID: mdl-28688226

ABSTRACT

Personalized treatment of patients based on tissue-specific cancer subtypes has strongly increased the efficacy of the chosen therapies. Even though the amount of data measured for cancer patients has increased over the last years, most cancer subtypes are still diagnosed based on individual data sources (e.g. gene expression data). We propose an unsupervised data integration method based on kernel principal component analysis. Principal component analysis is one of the most widely used techniques in data analysis. Unfortunately, the straightforward multiple kernel extension of this method leads to the use of only one of the input matrices, which does not fit the goal of gaining information from all data sources. Therefore, we present a scoring function to determine the impact of each input matrix. The approach enables visualizing the integrated data and subsequent clustering for cancer subtype identification. Due to the nature of the method, no hyperparameters have to be set. We apply the methodology to five different cancer data sets and demonstrate its advantages in terms of results and usability.


Subject(s)
Neoplasms/classification , Principal Component Analysis , Cluster Analysis , Humans , Machine Learning , Neoplasms/diagnosis , Neoplasms/genetics , Precision Medicine/methods , Survival Analysis
SELECTION OF CITATIONS
SEARCH DETAIL