Your browser doesn't support javascript.
loading
Variable selection and validation in multivariate modelling.
Shi, Lin; Westerhuis, Johan A; Rosén, Johan; Landberg, Rikard; Brunius, Carl.
Afiliação
  • Shi L; Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.
  • Westerhuis JA; Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.
  • Rosén J; Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam XH, The Netherlands.
  • Landberg R; Metabolomics Center, North-West University, X6001, Potchefstroom, South Africa.
  • Brunius C; Swedish National Food Agency, Uppsala, Sweden.
Bioinformatics ; 35(6): 972-980, 2019 03 15.
Article em En | MEDLINE | ID: mdl-30165467
ABSTRACT
MOTIVATION Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed.

RESULTS:

We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability. AVAILABILITY AND IMPLEMENTATION Algorithms, data, scripts and tutorial are open source and available as an R package ('MUVR') at https//gitlab.com/CarlBrunius/MUVR.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Assuntos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2019 Tipo de documento: Article