Your browser doesn't support javascript.
loading
Detecting the impact of subject characteristics on machine learning-based diagnostic applications.
Chaibub Neto, Elias; Pratap, Abhishek; Perumal, Thanneer M; Tummalacherla, Meghasyam; Snyder, Phil; Bot, Brian M; Trister, Andrew D; Friend, Stephen H; Mangravite, Lara; Omberg, Larsson.
Afiliación
  • Chaibub Neto E; 1Sage Bionetworks, Seattle, USA.
  • Pratap A; 1Sage Bionetworks, Seattle, USA.
  • Perumal TM; 2Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, USA.
  • Tummalacherla M; 1Sage Bionetworks, Seattle, USA.
  • Snyder P; 1Sage Bionetworks, Seattle, USA.
  • Bot BM; 1Sage Bionetworks, Seattle, USA.
  • Trister AD; 1Sage Bionetworks, Seattle, USA.
  • Friend SH; 1Sage Bionetworks, Seattle, USA.
  • Mangravite L; 1Sage Bionetworks, Seattle, USA.
  • Omberg L; 4YouandMe, Seattle, USA.
NPJ Digit Med ; 2: 99, 2019.
Article en En | MEDLINE | ID: mdl-31633058
ABSTRACT
Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets ("record-wise" data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of "identity confounding." In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.
Palabras clave

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Tipo de estudio: Diagnostic_studies / Prognostic_studies Idioma: En Revista: NPJ Digit Med Año: 2019 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Tipo de estudio: Diagnostic_studies / Prognostic_studies Idioma: En Revista: NPJ Digit Med Año: 2019 Tipo del documento: Article País de afiliación: Estados Unidos