RESUMEN
Recent magnetoencephalography (MEG) studies have reported that functional connectivity (FC) and power spectra can be used as neural fingerprints in differentiating individuals. Such studies have mainly used correlations between measurement sessions to distinguish individuals from each other. However, it has remained unclear whether such correlations might reflect a more generalizable principle of individually distinctive brain patterns. Here, we evaluated a machine-learning based approach, termed latent-noise Bayesian reduced rank regression (BRRR) as a means of modelling individual differences in the resting-state MEG data of the Human Connectome Project (HCP), using FC and power spectra as neural features. First, we verified that BRRR could model and reproduce the differences between metrics that correlation-based fingerprinting yields. We trained BRRR models to distinguish individuals based on data from one measurement and used the models to identify subsequent measurement sessions of those same individuals. The best performing BRRR models, using only 20 spatiospectral components, were able to identify subjects across measurement sessions with over 90% accuracy, approaching the highest correlation-based accuracies. Using cross-validation, we then determined whether that BRRR model could generalize to unseen subjects, successfully classifying the measurement sessions of novel individuals with over 80% accuracy. The results demonstrate that individual neurofunctional differences can be reliably extracted from MEG data with a low-dimensional predictive model and that the model is able to classify novel subjects.
Asunto(s)
Teorema de Bayes , Encéfalo , Conectoma , Magnetoencefalografía , Humanos , Magnetoencefalografía/métodos , Conectoma/métodos , Encéfalo/fisiología , Aprendizaje Automático , Masculino , Femenino , Adulto , Modelos NeurológicosRESUMEN
MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
Asunto(s)
Benchmarking , Exactitud de los Datos , Humanos , Genotipo , Fenotipo , Herencia MultifactorialRESUMEN
BACKGROUND: Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. METHODS: We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. RESULTS: We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. CONCLUSIONS: Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
Asunto(s)
Difusión de la Información , Humanos , Reino Unido , Conducta Cooperativa , Confidencialidad/normas , Privacidad , Bancos de Muestras Biológicas , Estudios ProspectivosRESUMEN
Drug combination therapy is a promising strategy to treat complex diseases such as cancer and infectious diseases. However, current knowledge of drug combination therapies, especially in cancer patients, is limited because of adverse drug effects, toxicity and cell line heterogeneity. Screening new drug combinations requires substantial efforts since considering all possible combinations between drugs is infeasible and expensive. Therefore, building computational approaches, particularly machine learning methods, could provide an effective strategy to overcome drug resistance and improve therapeutic efficacy. In this review, we group the state-of-the-art machine learning approaches to analyze personalized drug combination therapies into three categories and discuss each method in each category. We also present a short description of relevant databases used as a benchmark in drug combination therapies and provide a list of well-known, publicly available interactive data analysis portals. We highlight the importance of data integration on the identification of drug combinations. Finally, we address the advantages of combining multiple data sources on drug combination analysis by showing an experimental comparison.
Asunto(s)
Aprendizaje Automático , Protocolos de Quimioterapia Combinada Antineoplásica/administración & dosificación , Biología Computacional/métodos , Humanos , Neoplasias/tratamiento farmacológico , Medicina de PrecisiónRESUMEN
Predicting the response of cancer cell lines to specific drugs is one of the central problems in personalized medicine, where the cell lines show diverse characteristics. Researchers have developed a variety of computational methods to discover associations between drugs and cell lines, and improved drug sensitivity analyses by integrating heterogeneous biological data. However, choosing informative data sources and methods that can incorporate multiple sources efficiently is the challenging part of successful analysis in personalized medicine. The reason is that finding decisive factors of cancer and developing methods that can overcome the problems of integrating data, such as differences in data structures and data complexities, are difficult. In this review, we summarize recent advances in data integration-based machine learning for drug response prediction, by categorizing methods as matrix factorization-based, kernel-based and network-based methods. We also present a short description of relevant databases used as a benchmark in drug response prediction analyses, followed by providing a brief discussion of challenges faced in integrating and interpreting data from multiple sources. Finally, we address the advantages of combining multiple heterogeneous data sources on drug sensitivity analysis by showing an experimental comparison. Contact: betul.guvenc@aalto.fi.
Asunto(s)
Resistencia a Antineoplásicos , Genómica/métodos , Medicina de Precisión/métodos , Humanos , Aprendizaje Automático , Variantes FarmacogenómicasRESUMEN
Drug-induced liver injury (DILI) is an important safety concern and a major reason to remove a drug from the market. Advancements in recent machine learning methods have led to a wide range of in silico models for DILI predictive methods based on molecule chemical structures (fingerprints). Existing publicly available DILI data sets used for model building are based on the interpretation of drug labels or patient case reports, resulting in a typical binary clinical DILI annotation. We developed a novel phenotype-based annotation to process hepatotoxicity information extracted from repeated dose in vivo preclinical toxicology studies using INHAND annotation to provide a more informative and reliable data set for machine learning algorithms. This work resulted in a data set of 430 unique compounds covering diverse liver pathology findings which were utilized to develop multiple DILI prediction models trained on the publicly available data (TG-GATEs) using the compound's fingerprint. We demonstrate that the TG-GATEs compounds DILI labels can be predicted well and how the differences between TG-GATEs and the external test compounds (Johnson & Johnson) impact the model generalization performance.
Asunto(s)
Enfermedad Hepática Inducida por Sustancias y Drogas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Humanos , Algoritmos , Aprendizaje Automático , Simulación por ComputadorRESUMEN
BACKGROUND: A deep understanding of carcinogenesis at the DNA level underpins many advances in cancer prevention and treatment. Mutational signatures provide a breakthrough conceptualisation, as well as an analysis framework, that can be used to build such understanding. They capture somatic mutation patterns and at best identify their causes. Most studies in this context have focused on an inherently additive analysis, e.g. by non-negative matrix factorization, where the mutations within a cancer sample are explained by a linear combination of independent mutational signatures. However, other recent studies show that the mutational signatures exhibit non-additive interactions. RESULTS: We carefully analysed such additive model fits from the PCAWG study cataloguing mutational signatures as well as their activities across thousands of cancers. Our analysis identified systematic and non-random structure of residuals that is left unexplained by the additive model. We used hierarchical clustering to identify cancer subsets with similar residual profiles to show that both systematic mutation count overestimation and underestimation take place. We propose an extension to the additive mutational signature model-multiplicatively acting modulatory processes-and develop a maximum-likelihood framework to identify such modulatory mutational signatures. The augmented model is expressive enough to almost fully remove the observed systematic residual patterns. CONCLUSION: We suggest the modulatory processes biologically relate to sample specific DNA repair propensities with cancer or tissue type specific profiles. Overall, our results identify an interesting direction where to expand signature analysis.
Asunto(s)
Neoplasias , Humanos , Mutación , Neoplasias/genéticaRESUMEN
MOTIVATION: Interaction between the genotype and the environment (G×E) has a strong impact on the yield of major crop plants. Although influential, taking G×E explicitly into account in plant breeding has remained difficult. Recently G×E has been predicted from environmental and genomic covariates, but existing works have not shown that generalization to new environments and years without access to in-season data is possible and practical applicability remains unclear. Using data from a Barley breeding programme in Finland, we construct an in silico experiment to study the viability of G×E prediction under practical constraints. RESULTS: We show that the response to the environment of a new generation of untested Barley cultivars can be predicted in new locations and years using genomic data, machine learning and historical weather observations for the new locations. Our results highlight the need for models of G×E: non-linear effects clearly dominate linear ones, and the interaction between the soil type and daily rain is identified as the main driver for G×E for Barley in Finland. Our study implies that genomic selection can be used to capture the yield potential in G×E effects for future growth seasons, providing a possible means to achieve yield improvements, needed for feeding the growing population. AVAILABILITY AND IMPLEMENTATION: The data accompanied by the method code (http://research.cs.aalto.fi/pml/software/gxe/bioinformatics_codes.zip) is available in the form of kernels to allow reproducing the results. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genómica , Modelos Genéticos , Interacción Gen-Ambiente , Genotipo , Fenotipo , Tiempo (Meteorología)RESUMEN
MOTIVATION: Human genomic datasets often contain sensitive information that limits use and sharing of the data. In particular, simple anonymization strategies fail to provide sufficient level of protection for genomic data, because the data are inherently identifiable. Differentially private machine learning can help by guaranteeing that the published results do not leak too much information about any individual data point. Recent research has reached promising results on differentially private drug sensitivity prediction using gene expression data. Differentially private learning with genomic data is challenging because it is more difficult to guarantee privacy in high dimensions. Dimensionality reduction can help, but if the dimension reduction mapping is learned from the data, then it needs to be differentially private too, which can carry a significant privacy cost. Furthermore, the selection of any hyperparameters (such as the target dimensionality) needs to also avoid leaking private information. RESULTS: We study an approach that uses a large public dataset of similar type to learn a compact representation for differentially private learning. We compare three representation learning methods: variational autoencoders, principal component analysis and random projection. We solve two machine learning tasks on gene expression of cancer cell lines: cancer type classification, and drug sensitivity prediction. The experiments demonstrate significant benefit from all representation learning methods with variational autoencoders providing the most accurate predictions most often. Our results significantly improve over previous state-of-the-art in accuracy of differentially private drug sensitivity prediction. AVAILABILITY AND IMPLEMENTATION: Code used in the experiments is available at https://github.com/DPBayes/dp-representation-transfer.
Asunto(s)
Aprendizaje Automático , Humanos , NeoplasiasRESUMEN
MOTIVATION: Finding non-linear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity and computational overhead. Here we propose block HSIC Lasso, a non-linear feature selector that does not present the previous drawbacks. RESULTS: We compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons. AVAILABILITY AND IMPLEMENTATION: Block HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub (https://github.com/riken-aip/pyHSICLasso). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Biomarcadores , Estudio de Asociación del Genoma Completo , Programas Informáticos , Animales , Genoma , Genómica , RatonesRESUMEN
MOTIVATION: Metabolic flux balance analysis (FBA) is a standard tool in analyzing metabolic reaction rates compatible with measurements, steady-state and the metabolic reaction network stoichiometry. Flux analysis methods commonly place model assumptions on fluxes due to the convenience of formulating the problem as a linear programing model, while many methods do not consider the inherent uncertainty in flux estimates. RESULTS: We introduce a novel paradigm of Bayesian metabolic flux analysis that models the reactions of the whole genome-scale cellular system in probabilistic terms, and can infer the full flux vector distribution of genome-scale metabolic systems based on exchange and intracellular (e.g. 13C) flux measurements, steady-state assumptions, and objective function assumptions. The Bayesian model couples all fluxes jointly together in a simple truncated multivariate posterior distribution, which reveals informative flux couplings. Our model is a plug-in replacement to conventional metabolic balance methods, such as FBA. Our experiments indicate that we can characterize the genome-scale flux covariances, reveal flux couplings, and determine more intracellular unobserved fluxes in Clostridium acetobutylicum from 13C data than flux variability analysis. AVAILABILITY AND IMPLEMENTATION: The COBRA compatible software is available at github.com/markusheinonen/bamfa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Clostridium acetobutylicum , Análisis de Flujos Metabólicos , Teorema de Bayes , Redes y Vías Metabólicas , Modelos BiológicosRESUMEN
In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. KEY POINTS: ⢠DERA aldolases are promiscuous enzymes. ⢠Synthetic utility of DERA aldolase was improved by protein engineering approaches. ⢠Machine learning methods aid the protein engineering of DERA.
Asunto(s)
Escherichia coli , Fructosa-Bifosfato Aldolasa , Aldehído-Liasas/genética , Aldehído-Liasas/metabolismo , Escherichia coli/genética , Escherichia coli/metabolismo , Fructosa-Bifosfato Aldolasa/genética , Aprendizaje Automático , Ingeniería de Proteínas , Especificidad por SustratoRESUMEN
Brain structure and many brain functions are known to be genetically controlled, but direct links between neuroimaging measures and their underlying cellular-level determinants remain largely undiscovered. Here, we adopt a novel computational method for examining potential similarities in high-dimensional brain imaging data between siblings. We examine oscillatory brain activity measured with magnetoencephalography (MEG) in 201 healthy siblings and apply Bayesian reduced-rank regression to extract a low-dimensional representation of familial features in the participants' spectral power structure. Our results show that the structure of the overall spectral power at 1-90 Hz is a highly conspicuous feature that not only relates siblings to each other but also has very high consistency within participants' own data, irrespective of the exact experimental state of the participant. The analysis is extended by seeking genetic associations for low-dimensional descriptions of the oscillatory brain activity. The observed variability in the MEG spectral power structure was associated with SDK1 (sidekick cell adhesion molecule 1) and suggestively with several other genes that function, for example, in brain development. The current results highlight the potential of sophisticated computational methods in combining molecular and neuroimaging levels for exploring brain functions, even for high-dimensional data limited to a few hundred participants.
Asunto(s)
Mapeo Encefálico/métodos , Magnetoencefalografía/estadística & datos numéricos , Adulto , Algoritmos , Teorema de Bayes , Encéfalo/crecimiento & desarrollo , Moléculas de Adhesión Celular/genética , Familia , Femenino , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Procesamiento de Imagen Asistido por Computador , Imagen por Resonancia Magnética , Masculino , Modelos Neurológicos , Neuroimagen/métodos , Neuroimagen/estadística & datos numéricos , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
Motivation: Precision medicine requires the ability to predict the efficacies of different treatments for a given individual using high-dimensional genomic measurements. However, identifying predictive features remains a challenge when the sample size is small. Incorporating expert knowledge offers a promising approach to improve predictions, but collecting such knowledge is laborious if the number of candidate features is very large. Results: We introduce a probabilistic framework to incorporate expert feedback about the impact of genomic measurements on the outcome of interest and present a novel approach to collect the feedback efficiently, based on Bayesian experimental design. The new approach outperformed other recent alternatives in two medical applications: prediction of metabolic traits and prediction of sensitivity of cancer cells to different drugs, both using genomic features as predictors. Furthermore, the intelligent approach to collect feedback reduced the workload of the expert to approximately 11%, compared to a baseline approach. Availability and implementation: Source code implementing the introduced computational methods is freely available at https://github.com/AaltoPML/knowledge-elicitation-for-precision-medicine. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genómica/métodos , Medicina de Precisión/métodos , Programas Informáticos , Teorema de Bayes , Humanos , Análisis de Secuencia de ADN/métodosRESUMEN
Bayesian inference plays an important role in phylogenetics, evolutionary biology, and in many other branches of science. It provides a principled framework for dealing with uncertainty and quantifying how it changes in the light of new evidence. For many complex models and inference problems, however, only approximate quantitative answers are obtainable. Approximate Bayesian computation (ABC) refers to a family of algorithms for approximate inference that makes a minimal set of assumptions by only requiring that sampling from a model is possible. We explain here the fundamentals of ABC, review the classical algorithms, and highlight recent developments. [ABC; approximate Bayesian computation; Bayesian inference; likelihood-free inference; phylogenetics; simulator-based models; stochastic simulation models; tree-based models.]
Asunto(s)
Clasificación , Modelos Biológicos , Filogenia , Algoritmos , Teorema de BayesRESUMEN
BACKGROUND: Dispersed biomedical databases limit user exploration to generate structured knowledge. Linked Data unifies data structures and makes the dispersed data easy to search across resources, but it lacks supporting human cognition to achieve insights. In addition, potential errors in the data are difficult to detect in their free formats. Devising a visualization that synthesizes multiple sources in such a way that links between data sources are transparent, and uncertainties, such as data conflicts, are salient is challenging. RESULTS: To investigate the requirements and challenges of uncertainty-aware visualizations of linked data, we developed MediSyn, a system that synthesizes medical datasets to support drug treatment selection. It uses a matrix-based layout to visually link drugs, targets (e.g., mutations), and tumor types. Data uncertainties are salient in MediSyn; for example, (i) missing data are exposed in the matrix view of drug-target relations; (ii) inconsistencies between datasets are shown via overlaid layers; and (iii) data credibility is conveyed through links to data provenance. CONCLUSIONS: Through the synthesis of two manually curated datasets, cancer treatment biomarkers and drug-target bioactivities, a use case shows how MediSyn effectively supports the discovery of drug-repurposing opportunities. A study with six domain experts indicated that MediSyn benefited the drug selection and data inconsistency discovery. Though linked publication sources supported user exploration for further information, the causes of inconsistencies were not easy to find. Additionally, MediSyn could embrace more patient data to increase its informativeness. We derive design implications from the findings.
Asunto(s)
Bases de Datos Factuales , Quimioterapia , Programas Informáticos , Incertidumbre , Adulto , Femenino , Humanos , Encuestas y CuestionariosRESUMEN
MOTIVATION: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case versus control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. RESULTS: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. k-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. AVAILABILITY AND IMPLEMENTATION: The method can be implemented using standard clustering algorithms and normalized information distance, available in many statistical software packages. CONTACT: paul.blomstedt@aalto.fi or samuel.kaski@aalto.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Expresión Génica , Modelos Genéticos , Algoritmos , Análisis por Conglomerados , Perfilación de la Expresión GénicaRESUMEN
MOTIVATION: Modelling methods that find structure in data are necessary with the current large volumes of genomic data, and there have been various efforts to find subsets of genes exhibiting consistent patterns over subsets of treatments. These biclustering techniques have focused on one data source, often gene expression data. We present a Bayesian approach for joint biclustering of multiple data sources, extending a recent method Group Factor Analysis to have a biclustering interpretation with additional sparsity assumptions. The resulting method enables data-driven detection of linear structure present in parts of the data sources. RESULTS: Our simulation studies show that the proposed method reliably infers biclusters from heterogeneous data sources. We tested the method on data from the NCI-DREAM drug sensitivity prediction challenge, resulting in an excellent prediction accuracy. Moreover, the predictions are based on several biclusters which provide insight into the data sources, in this case on gene expression, DNA methylation, protein abundance, exome sequence, functional connectivity fingerprints and drug sensitivity. AVAILABILITY AND IMPLEMENTATION: http://research.cs.aalto.fi/pml/software/GFAsparse/ CONTACTS: : kerstin.bunte@googlemail.com or samuel.kaski@aalto.fi.
Asunto(s)
Algoritmos , Análisis por Conglomerados , Conjuntos de Datos como Asunto , Perfilación de la Expresión Génica , Teorema de Bayes , Análisis Factorial , Almacenamiento y Recuperación de la Información , Análisis de Secuencia por Matrices de OligonucleótidosRESUMEN
MOTIVATION: A key goal of computational personalized medicine is to systematically utilize genomic and other molecular features of samples to predict drug responses for a previously unseen sample. Such predictions are valuable for developing hypotheses for selecting therapies tailored for individual patients. This is especially valuable in oncology, where molecular and genetic heterogeneity of the cells has a major impact on the response. However, the prediction task is extremely challenging, raising the need for methods that can effectively model and predict drug responses. RESULTS: In this study, we propose a novel formulation of multi-task matrix factorization that allows selective data integration for predicting drug responses. To solve the modeling task, we extend the state-of-the-art kernelized Bayesian matrix factorization (KBMF) method with component-wise multiple kernel learning. In addition, our approach exploits the known pathway information in a novel and biologically meaningful fashion to learn the drug response associations. Our method quantitatively outperforms the state of the art on predicting drug responses in two publicly available cancer datasets as well as on a synthetic dataset. In addition, we validated our model predictions with lab experiments using an in-house cancer cell line panel. We finally show the practical applicability of the proposed method by utilizing prior knowledge to infer pathway-drug response associations, opening up the opportunity for elucidating drug action mechanisms. We demonstrate that pathway-response associations can be learned by the proposed model for the well-known EGFR and MEK inhibitors. AVAILABILITY AND IMPLEMENTATION: The source code implementing the method is available at http://research.cs.aalto.fi/pml/software/cwkbmf/ CONTACTS: muhammad.ammad-ud-din@aalto.fi or samuel.kaski@aalto.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genómica , Neoplasias , Algoritmos , Teorema de Bayes , Sistemas de Liberación de Medicamentos , Descubrimiento de Drogas , Humanos , Redes y Vías Metabólicas , Programas InformáticosRESUMEN
We hypothesize that brain activity can be used to control future information retrieval systems. To this end, we conducted a feasibility study on predicting the relevance of visual objects from brain activity. We analyze both magnetoencephalographic (MEG) and gaze signals from nine subjects who were viewing image collages, a subset of which was relevant to a predetermined task. We report three findings: i) the relevance of an image a subject looks at can be decoded from MEG signals with performance significantly better than chance, ii) fusion of gaze-based and MEG-based classifiers significantly improves the prediction performance compared to using either signal alone, and iii) non-linear classification of the MEG signals using Gaussian process classifiers outperforms linear classification. These findings break new ground for building brain-activity-based interactive image retrieval systems, as well as for systems utilizing feedback both from brain activity and eye movements.