Confound-leakage: confound removal in machine learning leads to leakage.

Hamdan, Sami; Love, Bradley C; von Polier, Georg G; Weis, Susanne; Schwender, Holger; Eickhoff, Simon B; Patil, Kaustubh R

Hamdan, Sami; Love, Bradley C; von Polier, Georg G; Weis, Susanne; Schwender, Holger; Eickhoff, Simon B; Patil, Kaustubh R.

Afiliação

Hamdan S; Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich, 52428 Jülich, Germany.
Love BC; Institute of Systems Neuroscience, Medical Faculty, Heinrich-Heine University Düsseldorf, 40225 Düsseldorf, Germany.
von Polier GG; Department of Experimental Psychology, University College London, WC1H 0AP London, UK.
Weis S; The Alan Turing Institute, London NW1 2DB, UK.
Schwender H; European Lab for Learning & Intelligent Systems (ELLIS), WC1E 6BT, London, UK.
Eickhoff SB; Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich, 52428 Jülich, Germany.
Patil KR; Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, University Hospital Frankfurt, 60528 Frankfurt, Germany.

Gigascience ; 122022 12 28.

Article em En | MEDLINE | ID: mdl-37776368

ABSTRACT

ABSTRACT

BACKGROUND:

Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.

RESULTS:

We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.

CONCLUSIONS:

Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.

Assuntos

Aprendizado de Máquina

Palavras-chave

clinical applications; confounding; data-leakage; machine-learning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Aprendizado de Máquina Tipo de estudo: Guideline / Prognostic_studies Idioma: En Revista: Gigascience Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Alemanha

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google