DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes.

Zhou, Nina; Wu, Qiucheng; Wu, Zewen; Marino, Simeone; Dinov, Ivo D

Zhou, Nina; Wu, Qiucheng; Wu, Zewen; Marino, Simeone; Dinov, Ivo D.

Afiliación

Zhou N; Statistics Online Computational Resource, Health Behavior and Biological, and Department of Biostatistics, University of Michigan, Ann Arbor, USA.
Wu Q; Statistics Online Computational Resource, Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, USA.
Wu Z; Statistics Online Computational Resource, Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, USA.
Marino S; Statistics Online Computational Resource, Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, USA.
Dinov ID; Statistics Online Computational Resource, Health Behavior and Biological Sciences, Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA. statistics@umich.edu.

J Med Syst ; 46(12): 96, 2022 Nov 16.

Article en En | MEDLINE | ID: mdl-36380246

ABSTRACT

ABSTRACT

Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naïve content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients' overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility).

Asunto(s)

Registros Electrónicos de Salud; Privacidad; Humanos

Palabras clave

AI; Clinical notes; Data science; ML; PHI; Synthetic data

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Privacidad / Registros Electrónicos de Salud Tipo de estudio: Diagnostic_studies / Prognostic_studies Límite: Humans Idioma: En Revista: J Med Syst Año: 2022 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google