Your browser doesn't support javascript.
loading
Generating synthetic clinical data that capture class imbalanced distributions with generative adversarial networks: Example using antiretroviral therapy for HIV.
Kuo, Nicholas I-Hsien; Garcia, Federico; Sönnerborg, Anders; Böhm, Michael; Kaiser, Rolf; Zazzi, Maurizio; Polizzotto, Mark; Jorm, Louisa; Barbieri, Sebastiano.
Afiliação
  • Kuo NI; Centre for Big Data Research in Health, the University of New South Wales, Sydney, Australia. Electronic address: n.kuo@unsw.edu.au.
  • Garcia F; Instituto de Investigación Ibs.Granada, Spain; Hospital Universitario San Cecilio, Spain; CIBER de Enfermedades Infecciosas, Spain.
  • Sönnerborg A; Hospital Karolinska Institutet, Sweden.
  • Böhm M; Uniklinik Köln, Universität zu Köln, Germany.
  • Kaiser R; Uniklinik Köln, Universität zu Köln, Germany.
  • Zazzi M; Università degli Studi di Siena, Italy.
  • Polizzotto M; Australian National University, Canberra, Australia.
  • Jorm L; Centre for Big Data Research in Health, the University of New South Wales, Sydney, Australia.
  • Barbieri S; Centre for Big Data Research in Health, the University of New South Wales, Sydney, Australia.
J Biomed Inform ; 144: 104436, 2023 08.
Article em En | MEDLINE | ID: mdl-37451495
ABSTRACT

OBJECTIVE:

Clinical data's confidential nature often limits the development of machine learning models in healthcare. Generative adversarial networks (GANs) can synthesise realistic datasets, but suffer from mode collapse, resulting in low diversity and bias towards majority demographics and common clinical practices. This work proposes an extension to the classic GAN framework that includes a variational autoencoder (VAE) and an external memory mechanism to overcome these limitations and generate synthetic data accurately describing imbalanced class distributions commonly found in clinical variables.

METHODS:

The proposed method generated a synthetic dataset related to antiretroviral therapy for human immunodeficiency virus (ART for HIV). We evaluated it based on five metrics (1) accurately representing imbalanced class distribution; (2) the realism of the individual variables; (3) the realism among variables; (4) patient disclosure risk; and (5) the utility of the generated dataset for developing downstream machine learning models.

RESULTS:

The proposed method overcomes the issue of mode collapse and generates a synthetic dataset that accurately describes imbalanced class distributions commonly found in clinical variables. The generated data has a patient disclosure risk of 0.095%, lower than the 9% threshold stated by Health Canada and the European Medicines Agency, making it suitable for distribution to the research community with high security. The generated data also has high utility, indicating the potential of the proposed method to enable the development of downstream machine learning algorithms for healthcare applications using synthetic data.

CONCLUSION:

Our proposed extension to the classic GAN framework, which includes a VAE and an external memory mechanism, represents a promising approach towards generating synthetic data that accurately describe imbalanced class distributions commonly found in clinical variables. This method overcomes the limitations of GANs and creates more realistic datasets with higher patient cohort diversity, facilitating the development of downstream machine learning algorithms for healthcare applications.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Infecções por HIV / HIV Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Infecções por HIV / HIV Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article