Your browser doesn't support javascript.
loading
A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization.
Hornung, Roman; Bernau, Christoph; Truntzer, Caroline; Wilson, Rory; Stadler, Thomas; Boulesteix, Anne-Laure.
Afiliación
  • Hornung R; Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. hornung@ibe.med.uni-muenchen.de.
  • Bernau C; Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. Christoph.Bernau@lrz.de.
  • Truntzer C; Leibniz Supercomputing Center, Boltzmannstr. 1, Garching, D-85748, Germany. Christoph.Bernau@lrz.de.
  • Wilson R; Clinical and Innovation Proteomic Platform, Pôle de Recherche Université de Bourgogne, 15 Bd Maréchal de Lattre de Tassigny, Dijon, F-21000, France. Caroline.TRUNTZER@clipproteomic.fr.
  • Stadler T; Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. wilson.rory@gmail.com.
  • Boulesteix AL; Department of Urology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. praxis@urologie-im-tal.de.
BMC Med Res Methodol ; 15: 95, 2015 Nov 04.
Article en En | MEDLINE | ID: mdl-26537575
BACKGROUND: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. METHODS: We devise the easily interpretable and general measure CVIIM ("CV Incompleteness Impact Measure") to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. RESULTS: Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. CONCLUSIONS: While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.
Asunto(s)

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Sesgo de Selección / Análisis de Regresión / Interpretación Estadística de Datos / Análisis de Componente Principal Tipo de estudio: Diagnostic_studies / Prognostic_studies / Risk_factors_studies Límite: Humans Idioma: En Revista: BMC Med Res Methodol Asunto de la revista: MEDICINA Año: 2015 Tipo del documento: Article País de afiliación: Alemania

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Sesgo de Selección / Análisis de Regresión / Interpretación Estadística de Datos / Análisis de Componente Principal Tipo de estudio: Diagnostic_studies / Prognostic_studies / Risk_factors_studies Límite: Humans Idioma: En Revista: BMC Med Res Methodol Asunto de la revista: MEDICINA Año: 2015 Tipo del documento: Article País de afiliación: Alemania