Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis.

Isasa, Imanol; Hernandez, Mikel; Epelde, Gorka; Londoño, Francisco; Beristain, Andoni; Larrea, Xabat; Alberdi, Ane; Bamidis, Panagiotis; Konstantinidis, Evdokimos

Isasa, Imanol; Hernandez, Mikel; Epelde, Gorka; Londoño, Francisco; Beristain, Andoni; Larrea, Xabat; Alberdi, Ane; Bamidis, Panagiotis; Konstantinidis, Evdokimos.

Isasa I; Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain.
Hernandez M; Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain.
Epelde G; Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia - San Sebastian, Spain.
Londoño F; Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain. gepelde@vicomtech.org.
Beristain A; eHealth Group, Biogipuzkoa Health Research Institute, Donostia-San Sebastian, Spain. gepelde@vicomtech.org.
Larrea X; Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain.
Alberdi A; Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, Spain.
Bamidis P; Computer Science and Artificial Intelligence Department, Computer Science Faculty, University of the Basque Country (UPV/EHU), Donostia - San Sebastian, Spain.
Konstantinidis E; eHealth Group, Biogipuzkoa Health Research Institute, Donostia-San Sebastian, Spain.

BMC Med Inform Decis Mak ; 24(1): 27, 2024 Jan 30.

Article en En | MEDLINE | ID: mdl-38291386

ABSTRACT

ABSTRACT

BACKGROUND:

Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects' metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts.

METHODS:

Three synthetic time series generation approaches were defined and compared in this work only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3).

RESULTS:

Three pivotal dimensions were assessed on the generated synthetic data resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric.

CONCLUSION:

The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.

Asunto(s)

Metadatos; Privacidad; Humanos; Factores de Tiempo; Bases de Datos Factuales

Palabras clave

Health data; Privacy-preserving data sharing; Synthetic data; Time series

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Privacidad / Metadatos Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Search on Google