Exploring the tradeoff between data privacy and utility with a clinical data analysis use case.

Im, Eunyoung; Kim, Hyeoneui; Lee, Hyungbok; Jiang, Xiaoqian; Kim, Ju Han

Im, Eunyoung; Kim, Hyeoneui; Lee, Hyungbok; Jiang, Xiaoqian; Kim, Ju Han.

Afiliação

Im E; College of Nursing, Seoul National University, Seoul, South Korea.
Kim H; Center for World-leading Human-care Nurse Leaders for the Future by Brain Korea 21 (BK 21) four project, College of Nursing, Seoul National University, Seoul, South Korea.
Lee H; College of Nursing, Seoul National University, Seoul, South Korea. ifilgood@snu.ac.kr.
Jiang X; Center for World-leading Human-care Nurse Leaders for the Future by Brain Korea 21 (BK 21) four project, College of Nursing, Seoul National University, Seoul, South Korea. ifilgood@snu.ac.kr.
Kim JH; The Research Institute of Nursing Science, Seoul National University, Seoul, South Korea. ifilgood@snu.ac.kr.

BMC Med Inform Decis Mak ; 24(1): 147, 2024 May 30.

Article em En | MEDLINE | ID: mdl-38816848

ABSTRACT

ABSTRACT

BACKGROUND:

Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset's utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset's utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility.

METHODS:

Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two.

RESULTS:

All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores.

CONCLUSIONS:

As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data's intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.

Assuntos

Confidencialidade; Anonimização de Dados; Humanos; Confidencialidade/normas; Serviço Hospitalar de Emergência; Tempo de Internação; República da Coreia; Masculino

Palavras-chave

ARX tool; Clinical data analysis; Data de-identification; Data privacy; Data utility

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Limite: Humans / Male País/Região como assunto: Asia Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google