Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data.

Pereira, Mayana; Kshirsagar, Meghana; Mukherjee, Sumit; Dodhia, Rahul; Lavista Ferres, Juan; de Sousa, Rafael

Pereira, Mayana; Kshirsagar, Meghana; Mukherjee, Sumit; Dodhia, Rahul; Lavista Ferres, Juan; de Sousa, Rafael.

Afiliación

Pereira M; AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America.
Kshirsagar M; Department of Electrical Engineering, University of Brasilia, Brasilia, Brazil.
Mukherjee S; AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America.
Dodhia R; INSITRO, San Francisco, CA, United States of America.
Lavista Ferres J; AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America.
de Sousa R; AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America.

PLoS One ; 19(2): e0297271, 2024.

Article en En | MEDLINE | ID: mdl-38315667

ABSTRACT

ABSTRACT

Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We systematically investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms marginal-based and GAN-based. To the best of our knowledge, our work is the first that (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic dataset generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generated using AIM and MWEM PGM algorithms can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.

Asunto(s)

Algoritmos; Instituciones de Salud; Diseño Interior y Mobiliario; Conocimiento; Aprendizaje Automático

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Algoritmos / Instituciones de Salud Tipo de estudio: Prognostic_studies Idioma: En Revista: PLoS One Asunto de la revista: CIENCIA / MEDICINA Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google