|

An omics-based machine learning approach to predict diabetes progression: a RHAPSODY study.

Slieker, Roderick C; Münch, Magnus; Donnelly, Louise A; Bouland, Gerard A; Dragan, Iulian; Kuznetsov, Dmitry; Elders, Petra J M; Rutter, Guy A; Ibberson, Mark; Pearson, Ewan R; 't Hart, Leen M; van de Wiel, Mark A; Beulens, Joline W J.

Diabetologia ; 67(5): 885-894, 2024 May.

Article En | MEDLINE | ID: mdl-38374450

AIMS/HYPOTHESIS: People with type 2 diabetes are heterogeneous in their disease trajectory, with some progressing more quickly to insulin initiation than others. Although classical biomarkers such as age, HbA1c and diabetes duration are associated with glycaemic progression, it is unclear how well such variables predict insulin initiation or requirement and whether newly identified markers have added predictive value. METHODS: In two prospective cohort studies as part of IMI-RHAPSODY, we investigated whether clinical variables and three types of molecular markers (metabolites, lipids, proteins) can predict time to insulin requirement using different machine learning approaches (lasso, ridge, GRridge, random forest). Clinical variables included age, sex, HbA1c, HDL-cholesterol and C-peptide. Models were run with unpenalised clinical variables (i.e. always included in the model without weights) or penalised clinical variables, or without clinical variables. Model development was performed in one cohort and the model was applied in a second cohort. Model performance was evaluated using Harrel's C statistic. RESULTS: Of the 585 individuals from the Hoorn Diabetes Care System (DCS) cohort, 69 required insulin during follow-up (1.0-11.4 years); of the 571 individuals in the Genetics of Diabetes Audit and Research in Tayside Scotland (GoDARTS) cohort, 175 required insulin during follow-up (0.3-11.8 years). Overall, the clinical variables and proteins were selected in the different models most often, followed by the metabolites. The most frequently selected clinical variables were HbA1c (18 of the 36 models, 50%), age (15 models, 41.2%) and C-peptide (15 models, 41.2%). Base models (age, sex, BMI, HbA1c) including only clinical variables performed moderately in both the DCS discovery cohort (C statistic 0.71 [95% CI 0.64, 0.79]) and the GoDARTS replication cohort (C 0.71 [95% CI 0.69, 0.75]). A more extensive model including HDL-cholesterol and C-peptide performed better in both cohorts (DCS, C 0.74 [95% CI 0.67, 0.81]; GoDARTS, C 0.73 [95% CI 0.69, 0.77]). Two proteins, lactadherin and proto-oncogene tyrosine-protein kinase receptor, were most consistently selected and slightly improved model performance. CONCLUSIONS/INTERPRETATION: Using machine learning approaches, we show that insulin requirement risk can be modestly well predicted by predominantly clinical variables. Inclusion of molecular markers improves the prognostic performance beyond that of clinical variables by up to 5%. Such prognostic models could be useful for identifying people with diabetes at high risk of progressing quickly to treatment intensification. DATA AVAILABILITY: Summary statistics of lipidomic, proteomic and metabolomic data are available from a Shiny dashboard at https://rhapdata-app.vital-it.ch .

Diabetes Mellitus, Type 2 , Humans , Diabetes Mellitus, Type 2/metabolism , Prospective Studies , C-Peptide , Proteomics , Insulin/therapeutic use , Biomarkers , Machine Learning , Cholesterol

Semi-supervised empirical Bayes group-regularized factor regression.

Münch, Magnus M; van de Wiel, Mark A; van der Vaart, Aad W; Peeters, Carel F W.

Biom J ; 64(7): 1289-1306, 2022 10.

Article En | MEDLINE | ID: mdl-35730912

The features in a high-dimensional biomedical prediction problem are often well described by low-dimensional latent variables (or factors). We use this to include unlabeled features and additional information on the features when building a prediction model. Such additional feature information is often available in biomedical applications. Examples are annotation of genes, metabolites, or p-values from a previous study. We employ a Bayesian factor regression model that jointly models the features and the outcome using Gaussian latent variables. We fit the model using a computationally efficient variational Bayes method, which scales to high dimensions. We use the extra information to set up a prior model for the features in terms of hyperparameters, which are then estimated through empirical Bayes. The method is demonstrated in simulations and two applications. One application considers influenza vaccine efficacy prediction based on microarray data. The second application predicts oral cancer metastasis from RNAseq data.

Algorithms , Research Design , Bayes Theorem , Normal Distribution

Adaptive group-regularized logistic elastic net regression.

Münch, Magnus M; Peeters, Carel F W; Van Der Vaart, Aad W; Van De Wiel, Mark A.

Biostatistics ; 22(4): 723-737, 2021 10 13.

Article En | MEDLINE | ID: mdl-31886488

In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (i) $p$-values from a previous study and (ii) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection but is not straightforward. We propose a group-regularized (logistic) elastic net regression method, where each penalty parameter corresponds to a group of features based on the external information. The method, termed gren, makes use of the Bayesian formulation of logistic elastic net regression to estimate both the model and penalty parameters in an approximate empirical-variational Bayes framework. Simulations and applications to three cancer genomics studies and one Alzheimer metabolomics study show that, if the partitioning of the features is informative, classification performance, and feature selection are indeed enhanced.

Genomics , Neoplasms , Bayes Theorem , Humans , Logistic Models , Regression Analysis

Drug sensitivity prediction with normal inverse Gaussian shrinkage informed by external data.

Münch, Magnus M; van de Wiel, Mark A; Richardson, Sylvia; Leday, Gwenaël G R.

Biom J ; 63(2): 289-304, 2021 02.

Article En | MEDLINE | ID: mdl-33155717

In precision medicine, a common problem is drug sensitivity prediction from cancer tissue cell lines. These types of problems entail modelling multivariate drug responses on high-dimensional molecular feature sets in typically >1000 cell lines. The dimensions of the problem require specialised models and estimation methods. In addition, external information on both the drugs and the features is often available. We propose to model the drug responses through a linear regression with shrinkage enforced through a normal inverse Gaussian prior. We let the prior depend on the external information, and estimate the model and external information dependence in an empirical-variational Bayes framework. We demonstrate the usefulness of this model in both a simulated setting and in the publicly available Genomics of Drug Sensitivity in Cancer data.

Genomics , Pharmaceutical Preparations , Bayes Theorem , Normal Distribution , Precision Medicine

Infection and RNA-seq analysis of a zebrafish tlr2 mutant shows a broad function of this toll-like receptor in transcriptional and metabolic control and defense to Mycobacterium marinum infection.

Hu, Wanbin; Yang, Shuxin; Shimada, Yasuhito; Münch, Magnus; Marín-Juez, Rubén; Meijer, Annemarie H; Spaink, Herman P.

BMC Genomics ; 20(1): 878, 2019 Nov 20.

Article En | MEDLINE | ID: mdl-31747871

BACKGROUND: The function of Toll-like receptor 2 (TLR2) in host defense against pathogens, especially Mycobacterium tuberculosis (Mtb) is poorly understood. To investigate the role of TLR2 during mycobacterial infection, we analyzed the response of tlr2 zebrafish mutant larvae to infection with Mycobacterium marinum (Mm), a close relative to Mtb, as a model for tuberculosis. We measured infection phenotypes and transcriptome responses using RNA deep sequencing in mutant and control larvae. RESULTS: tlr2 mutant embryos at 2 dpf do not show differences in numbers of macrophages and neutrophils compared to control embryos. However, we found substantial changes in gene expression in these mutants, particularly in metabolic pathways, when compared with the heterozygote tlr2+/- control. At 4 days after Mm infection, the total bacterial burden and the presence of extracellular bacteria were higher in tlr2-/- larvae than in tlr2+/-, or tlr2+/+ larvae, whereas granuloma numbers were reduced, showing a function of Tlr2 in zebrafish host defense. RNAseq analysis of infected tlr2-/- versus tlr2+/- shows that the number of up-regulated and down-regulated genes in response to infection was greatly diminished in tlr2 mutants by at least 2 fold and 10 fold, respectively. Analysis of the transcriptome data and qPCR validation shows that Mm infection of tlr2 mutants leads to decreased mRNA levels of genes involved in inflammation and immune responses, including il1b, tnfb, cxcl11aa/ac, fosl1a, and cebpb. Furthermore, RNAseq analyses revealed that the expression of genes for Maf family transcription factors, vitamin D receptors, and Dicps proteins is altered in tlr2 mutants with or without infection. In addition, the data indicate a function of Tlr2 in the control of induction of cytokines and chemokines, such as the CXCR3-CXCL11 signaling axis. CONCLUSION: The transcriptome and infection burden analyses show a function of Tlr2 as a protective factor against mycobacteria. Transcriptome analysis revealed tlr2-specific pathways involved in Mm infection, which are related to responses to Mtb infection in human macrophages. Considering its dominant function in control of transcriptional processes that govern defense responses and metabolism, the TLR2 protein can be expected to be also of importance for other infectious diseases and interactions with the microbiome.

Fish Diseases/genetics , Gene Expression Regulation, Developmental , Mycobacterium Infections, Nontuberculous/genetics , Mycobacterium Infections, Nontuberculous/veterinary , Toll-Like Receptor 2/genetics , Zebrafish/genetics , Animals , CCAAT-Enhancer-Binding Protein-beta/genetics , CCAAT-Enhancer-Binding Protein-beta/immunology , Chemokine CXCL11/genetics , Chemokine CXCL11/immunology , Disease Resistance/genetics , Embryo, Nonmammalian , Fish Diseases/immunology , Fish Diseases/microbiology , Host-Pathogen Interactions/genetics , Host-Pathogen Interactions/immunology , Immunity, Innate , Interleukin-1beta/genetics , Interleukin-1beta/immunology , Larva/genetics , Larva/growth & development , Larva/immunology , Larva/microbiology , Lymphotoxin-alpha/genetics , Lymphotoxin-alpha/immunology , Macrophages/immunology , Macrophages/microbiology , Maf Transcription Factors/genetics , Maf Transcription Factors/immunology , Metabolic Networks and Pathways/genetics , Metabolic Networks and Pathways/immunology , Mycobacterium Infections, Nontuberculous/immunology , Mycobacterium Infections, Nontuberculous/microbiology , Mycobacterium marinum/immunology , Mycobacterium marinum/pathogenicity , Neutrophils/immunology , Neutrophils/microbiology , Proto-Oncogene Proteins c-fos/genetics , Proto-Oncogene Proteins c-fos/immunology , Receptors, CXCR3/genetics , Receptors, CXCR3/immunology , Receptors, Immunologic/genetics , Receptors, Immunologic/immunology , Toll-Like Receptor 2/deficiency , Toll-Like Receptor 2/immunology , Transcriptome/immunology , Zebrafish/growth & development , Zebrafish/immunology , Zebrafish/microbiology , Zebrafish Proteins/genetics , Zebrafish Proteins/immunology

Learning from a lot: Empirical Bayes for high-dimensional model-based prediction.

van de Wiel, Mark A; Te Beest, Dennis E; Münch, Magnus M.

Scand Stat Theory Appl ; 46(1): 2-25, 2019 Mar.

Article En | MEDLINE | ID: mdl-31007342

Empirical Bayes is a versatile approach to "learn from a lot" in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss "formal" empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed "co-data". In particular, we present two novel examples that allow for co-data: first, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types and, second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.