An evaluation of synthetic data augmentation for mitigating covariate bias in health data.

Juwara, Lamin; El-Hussuna, Alaa; El Emam, Khaled

Juwara, Lamin; El-Hussuna, Alaa; El Emam, Khaled.

Affiliation

Juwara L; School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
El-Hussuna A; Research Institute, Children's Hospital of Eastern Ontario, Ottawa, ON, Canada.
El Emam K; Open Source Research Collaboration, Aalborg, Denmark.

Patterns (N Y) ; 5(4): 100946, 2024 Apr 12.

Article in En | MEDLINE | ID: mdl-38645766

ABSTRACT

ABSTRACT

Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.

Key words

classification; covariate imbalance; data bias; fairness; generative model; synthetic data generation

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: Patterns (N Y) Year: 2024 Document type: Article Affiliation country: Canadá

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: Patterns (N Y) Year: 2024 Document type: Article Affiliation country: Canadá