Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros

Base de dados
País como assunto
Ano de publicação
Tipo de documento
Intervalo de ano de publicação
1.
BMC Med Res Methodol ; 24(1): 136, 2024 Jun 22.
Artigo em Inglês | MEDLINE | ID: mdl-38909216

RESUMO

BACKGROUND: Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention. METHODS: Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts. RESULTS: The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand. CONCLUSION: Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde , Humanos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Registros Eletrônicos de Saúde/normas , Cadeias de Markov , Informática Médica/métodos , Informática Médica/estatística & dados numéricos
2.
Sensors (Basel) ; 24(1)2024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38203126

RESUMO

Synthetic data generation addresses the challenges of obtaining extensive empirical datasets, offering benefits such as cost-effectiveness, time efficiency, and robust model development. Nonetheless, synthetic data-generation methodologies still encounter significant difficulties, including a lack of standardized metrics for modeling different data types and comparing generated results. This study introduces PVS-GEN, an automated, general-purpose process for synthetic data generation and verification. The PVS-GEN method parameterizes time-series data with minimal human intervention and verifies model construction using a specific metric derived from extracted parameters. For complex data, the process iteratively segments the empirical dataset until an extracted parameter can reproduce synthetic data that reflects the empirical characteristics, irrespective of the sensor data type. Moreover, we introduce the PoR metric to quantify the quality of the generated data by evaluating its time-series characteristics. Consequently, the proposed method can automatically generate diverse time-series data that covers a wide range of sensor types. We compared PVS-GEN with existing synthetic data-generation methodologies, and PVS-GEN demonstrated a superior performance. It generated data with a similarity of up to 37.1% across multiple data types and by 19.6% on average using the proposed metric, irrespective of the data type.

3.
Sensors (Basel) ; 24(1)2023 Dec 26.
Artigo em Inglês | MEDLINE | ID: mdl-38202989

RESUMO

Data scarcity is a significant obstacle for modern data science and artificial intelligence research communities. The fact that abundant data are a key element of a powerful prediction model is well known through various past studies. However, industrial control systems (ICS) are operated in a closed environment due to security and privacy issues, so collected data are generally not disclosed. In this environment, synthetic data generation can be a good alternative. However, ICS datasets have time-series characteristics and include features with short- and long-term temporal dependencies. In this paper, we propose the attention-based variational recurrent autoencoder (AVRAE) for generating time-series ICS data. We first extend the evidence lower bound of the variational inference to time-series data. Then, a recurrent neural-network-based autoencoder is designed to take this as the objective. AVRAE employs the attention mechanism to effectively learn the long-term and short-term temporal dependencies ICS data implies. Finally, we present an algorithm for generating synthetic ICS time-series data using learned AVRAE. In a comprehensive evaluation using the ICS dataset HAI and various performance indicators, AVRAE successfully generated visually and statistically plausible synthetic ICS data.

4.
Stud Health Technol Inform ; 316: 963-967, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176952

RESUMO

Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.


Assuntos
Benchmarking , Humanos
5.
Patterns (N Y) ; 5(4): 100946, 2024 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-38645766

RESUMO

Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.

6.
J Imaging Inform Med ; 37(3): 1217-1227, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38351224

RESUMO

To generate synthetic medical data incorporating image-tabular hybrid data by merging an image encoding/decoding model with a table-compatible generative model and assess their utility. We used 1342 cases from the Stony Brook University Covid-19-positive cases, comprising chest X-ray radiographs (CXRs) and tabular clinical data as a private dataset (pDS). We generated a synthetic dataset (sDS) through the following steps: (I) dimensionally reducing CXRs in the pDS using a pretrained encoder of the auto-encoding generative adversarial networks (αGAN) and integrating them with the correspondent tabular clinical data; (II) training the conditional tabular GAN (CTGAN) on this combined data to generate synthetic records, encompassing encoded image features and clinical data; and (III) reconstructing synthetic images from these encoded image features in the sDS using a pretrained decoder of the αGAN. The utility of sDS was assessed by the performance of the prediction models for patient outcomes (deceased or discharged). For the pDS test set, the area under the receiver operating characteristic (AUC) curve was calculated to compare the performance of prediction models trained separately with pDS, sDS, or a combination of both. We created an sDS comprising CXRs with a resolution of 256 × 256 pixels and tabular data containing 13 variables. The AUC for the outcome was 0.83 when the model was trained with the pDS, 0.74 with the sDS, and 0.87 when combining pDS and sDS for training. Our method is effective for generating synthetic records consisting of both images and tabular clinical data.


Assuntos
COVID-19 , Radiografia Torácica , SARS-CoV-2 , Humanos , COVID-19/diagnóstico por imagem , Radiografia Torácica/métodos , Feminino , Masculino , Pessoa de Meia-Idade , Idoso , Curva ROC , Adulto
7.
JMIR AI ; 3: e52615, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38875595

RESUMO

Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.

8.
Stud Health Technol Inform ; 316: 705-709, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176892

RESUMO

To address privacy and ethical issues in using health data for machine learning, we evaluate the scalability of advanced synthetic data generation methods like GANs, VAEs, copulaGAN, and transformer models specifically for patient service utilization data. Our study examines five models on data from a Canadian health authority, focusing on training and generation efficiency, data resemblance, and practical utility. Our findings indicate that statistical models excel in efficiency, while most models produce synthetic data that closely mirrors real data, and is also useful for real-world applications.


Assuntos
Aprendizado de Máquina , Canadá , Humanos , Modelos Estatísticos , Registros Eletrônicos de Saúde
9.
Comput Biol Med ; 174: 108389, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38593640

RESUMO

PURPOSE: To evaluate the potential of synthetic radiomic data generation in addressing data scarcity in radiomics/radiogenomics models. METHODS: This study was conducted on a retrospectively collected cohort of 386 colorectal cancer patients (n = 2570 lesions) for whom matched contrast-enhanced CT images and gene TP53 mutational status were available. The full cohort data was divided into a training cohort (n = 2055 lesions) and an independent and fixed test set (n = 515 lesions). Differently sized training sets were subsampled from the training cohort to measure the impact of sample size on model performance and assess the added value of synthetic radiomic augmentation at different sizes. Five different tabular synthetic data generation models were used to generate synthetic radiomic data based on "real-world" radiomics data extracted from this cohort. The quality and reproducibility of the generated synthetic radiomic data were assessed. Synthetic radiomics were then combined with "real-world" radiomic training data to evaluate their impact on the predictive model's performance. RESULTS: A prediction model was generated using only "real-world" radiomic data, revealing the impact of data scarcity in this particular data set through a lack of predictive performance at low training sample numbers (n = 200, 400, 1000 lesions with average AUC = 0.52, 0.53, and 0.56 respectively, compared to 0.64 when using 2055 training lesions). Synthetic tabular data generation models created reproducible synthetic radiomic data with properties highly similar to "real-world" data (for n = 1000 lesions, average Chi-square = 0.932, average basic statistical correlation = 0.844). The integration of synthetic radiomic data consistently enhanced the performance of predictive models trained with small sample size sets (AUC enhanced by 9.6%, 11.3%, and 16.7% for models trained on n_samples = 200, 400, and 1000 lesions, respectively). In contrast, synthetic data generated from randomised/noisy radiomic data failed to enhance predictive performance underlining the requirement of true signal data to do so. CONCLUSION: Synthetic radiomic data, when combined with real radiomics, could enhance the performance of predictive models. Tabular synthetic data generation might help to overcome limitations in medical AI stemming from data scarcity.


Assuntos
Neoplasias Colorretais , Tomografia Computadorizada por Raios X , Humanos , Neoplasias Colorretais/diagnóstico por imagem , Neoplasias Colorretais/genética , Feminino , Masculino , Tomografia Computadorizada por Raios X/métodos , Estudos Retrospectivos , Pessoa de Meia-Idade , Idoso , Genômica , Proteína Supressora de Tumor p53/genética , Radiômica
10.
Heliyon ; 10(1): e20173, 2024 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-38173493

RESUMO

Detection of volatile organic compounds in exhaled air is a promising approach to non-invasive and scalable gastric cancer screening. This work proposes a new approach for the detection of volatile organic compounds by analyzing odor-evoked calcium responses in the rat olfactory bulb. We estimate the feasibility of gastric cancer biomarker detection added to the exhaled air of healthy participants. Our detector consists of a convolutional encoder and a similarity-based classifier over encoder outputs. To minimize overfitting on a small available training set, we involve a pre-training where the encoder is trained on synthetic data representing spatiotemporal patterns similar to real calcium responses in the olfactory bulb. We estimate the classification accuracy of exhaled air samples by matching their encodings with encodings of calibration samples of two classes: 1) exhaled air and 2) a mixture of exhaled air with the cancer biomarker. On our data, the accuracy increased from 0.68 on real data up to 0.74 if pre-training on synthetic data is involved. Our work is focused on proving the feasibility of proposed new approach rather than on comparing its efficiency with existing methods. Such detection is often performed with an electronic nose, but its output becomes unstable over time due to a sensor drift. In contrast to the electronic nose, rats can robustly detect low concentrations of biomarkers over lifetime. The feasibility of gastric cancer biomarker detection in exhaled air by bio-hybrid system is shown. Pre-training of neural models for images analysis increases the accuracy of detection.

11.
Comput Struct Biotechnol J ; 23: 2892-2910, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-39108677

RESUMO

Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algorithms on unbiased data with sufficient sample size and statistical power. Our review explores the application and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics, time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The type of method used for the synthetic data generation process was identified in each study and was categorized into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming languages used for the implementation of each method. Our evaluation revealed that the majority of the studies utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-quality, representative multimodal datasets without exposing sensitive patient information, among others. We underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with 75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is finally provided to accelerate research in the field.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa