RESUMO
The number of papers presenting machine learning (ML) models that are being submitted to and published in the Journal of Medical Internet Research and other JMIR Publications journals has steadily increased. Editors and peer reviewers involved in the review process for such manuscripts often go through multiple review cycles to enhance the quality and completeness of reporting. The use of reporting guidelines or checklists can help ensure consistency in the quality of submitted (and published) scientific manuscripts and, for example, avoid instances of missing information. In this Editorial, the editors of JMIR Publications journals discuss the general JMIR Publications policy regarding authors' application of reporting guidelines and specifically focus on the reporting of ML studies in JMIR Publications journals, using the Consolidated Reporting of Machine Learning Studies (CREMLS) guidelines, with an example of how authors and other journals could use the CREMLS checklist to ensure transparency and rigor in reporting.
Assuntos
Aprendizado de Máquina , Humanos , Guias como Assunto , Prognóstico , Lista de ChecagemRESUMO
OBJECTIVE: To provide a brief overview of artificial intelligence (AI) application within the field of eating disorders (EDs) and propose focused solutions for research. METHOD: An overview and summary of AI application pertinent to EDs with focus on AI's ability to address issues relating to data sharing and pooling (and associated privacy concerns), data augmentation, as well as bias within datasets is provided. RESULTS: In addition to clinical applications, AI can utilize useful tools to help combat commonly encountered challenges in ED research, including issues relating to low prevalence of specific subpopulations of patients, small overall sample sizes, and bias within datasets. DISCUSSION: There is tremendous potential to embed and utilize various facets of artificial intelligence (AI) to help improve our understanding of EDs and further evaluate and investigate questions that ultimately seek to improve outcomes. Beyond the technology, issues relating to regulation of AI, establishing ethical guidelines for its application, and the trust of providers and patients are all needed for ultimate adoption and acceptance into ED practice. PUBLIC SIGNIFICANCE: Artificial intelligence (AI) offers a promise of significant potential within the realm of eating disorders (EDs) and encompasses a broad set of techniques that offer utility in various facets of ED research and by extension delivery of clinical care. Beyond the technology, issues relating to regulation, establishing ethical guidelines for application, and the trust of providers and patients are needed for the ultimate adoption and acceptance of AI into ED practice.
Assuntos
Inteligência Artificial , Transtornos da Alimentação e da Ingestão de Alimentos , Humanos , Transtornos da Alimentação e da Ingestão de Alimentos/terapia , Pesquisa BiomédicaRESUMO
Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.
RESUMO
BACKGROUND: Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. OBJECTIVE: This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. METHODS: We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. RESULTS: The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. CONCLUSIONS: We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.
RESUMO
Patients, families, healthcare providers and funders face multiple comparable treatment options without knowing which provides the best quality of care. As a step towards improving this, the REthinking Clinical Trials (REaCT) pragmatic trials program started in 2014 to break down many of the traditional barriers to performing clinical trials. However, until other innovative methodologies become widely used, the impact of this program will remain limited. These innovations include the incorporation of near equivalence analyses and the incorporation of artificial intelligence (AI) into clinical trial design. Near equivalence analyses allow for the comparison of different treatments (drug and non-drug) using quality of life, toxicity, cost-effectiveness, and pharmacokinetic/pharmacodynamic data. AI offers unique opportunities to maximize the information gleaned from clinical trials, reduces sample size estimates, and can potentially "rescue" poorly accruing trials. On 2 May 2023, the first REaCT international symposium took place to connect clinicians and scientists, set goals and identify future avenues for investigator-led clinical trials. Here, we summarize the topics presented at this meeting to promote sharing and support other similarly motivated groups to learn and share their experiences.
Assuntos
Neoplasias , Qualidade de Vida , Humanos , Inteligência Artificial , Pessoal de Saúde , Neoplasias/terapia , Qualidade da Assistência à Saúde , Ensaios Clínicos como AssuntoRESUMO
Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.
Assuntos
Benchmarking , Revelação , Simulação por Computador , Modelos Logísticos , Processos MentaisRESUMO
PURPOSE: There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques. METHODS: We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk. RESULTS: Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models. DISCUSSION: Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.
Assuntos
Neoplasias da Mama , Privacidade , Humanos , Feminino , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/terapia , Oncologia , PesquisadoresRESUMO
BACKGROUND: It is evident that COVID-19 will remain a public health concern in the coming years, largely driven by variants of concern (VOC). It is critical to continuously monitor vaccine effectiveness as new variants emerge and new vaccines and/or boosters are developed. Systematic surveillance of the scientific evidence base is necessary to inform public health action and identify key uncertainties. Evidence syntheses may also be used to populate models to fill in research gaps and help to prepare for future public health crises. This protocol outlines the rationale and methods for a living evidence synthesis of the effectiveness of COVID-19 vaccines in reducing the morbidity and mortality associated with, and transmission of, VOC of SARS-CoV-2. METHODS: Living evidence syntheses of vaccine effectiveness will be carried out over one year for (1) a range of potential outcomes in the index individual associated with VOC (pathogenesis); and (2) transmission of VOC. The literature search will be conducted up to May 2023. Observational and database-linkage primary studies will be included, as well as RCTs. Information sources include electronic databases (MEDLINE; Embase; Cochrane, L*OVE; the CNKI and Wangfang platforms), pre-print servers (medRxiv, BiorXiv), and online repositories of grey literature. Title and abstract and full-text screening will be performed by two reviewers using a liberal accelerated method. Data extraction and risk of bias assessment will be completed by one reviewer with verification of the assessment by a second reviewer. Results from included studies will be pooled via random effects meta-analysis when appropriate, or otherwise summarized narratively. DISCUSSION: Evidence generated from our living evidence synthesis will be used to inform policy making, modelling, and prioritization of future research on the effectiveness of COVID-19 vaccines against VOC.
Assuntos
COVID-19 , Humanos , COVID-19/prevenção & controle , Vacinas contra COVID-19 , SARS-CoV-2 , Eficácia de Vacinas , Viés , Metanálise como AssuntoRESUMO
BACKGROUND: The reporting of machine learning (ML) prognostic and diagnostic modeling studies is often inadequate, making it difficult to understand and replicate such studies. To address this issue, multiple consensus and expert reporting guidelines for ML studies have been published. However, these guidelines cover different parts of the analytics lifecycle, and individually, none of them provide a complete set of reporting requirements. OBJECTIVE: We aimed to consolidate the ML reporting guidelines and checklists in the literature to provide reporting items for prognostic and diagnostic ML in in-silico and shadow mode studies. METHODS: We conducted a literature search that identified 192 unique peer-reviewed English articles that provide guidance and checklists for reporting ML studies. The articles were screened by their title and abstract against a set of 9 inclusion and exclusion criteria. Articles that were filtered through had their quality evaluated by 2 raters using a 9-point checklist constructed from guideline development good practices. The average κ was 0.71 across all quality criteria. The resulting 17 high-quality source papers were defined as having a quality score equal to or higher than the median. The reporting items in these 17 articles were consolidated and screened against a set of 6 inclusion and exclusion criteria. The resulting reporting items were sent to an external group of 11 ML experts for review and updated accordingly. The updated checklist was used to assess the reporting in 6 recent modeling papers in JMIR AI. Feedback from the external review and initial validation efforts was used to improve the reporting items. RESULTS: In total, 37 reporting items were identified and grouped into 5 categories based on the stage of the ML project: defining the study details, defining and collecting the data, modeling methodology, model evaluation, and explainability. None of the 17 source articles covered all the reporting items. The study details and data description reporting items were the most common in the source literature, with explainability and methodology guidance (ie, data preparation and model training) having the least coverage. For instance, a median of 75% of the data description reporting items appeared in each of the 17 high-quality source guidelines, but only a median of 33% of the data explainability reporting items appeared. The highest-quality source articles tended to have more items on reporting study details. Other categories of reporting items were not related to the source article quality. We converted the reporting items into a checklist to support more complete reporting. CONCLUSIONS: Our findings supported the need for a set of consolidated reporting items, given that existing high-quality guidelines and checklists do not individually provide complete coverage. The consolidated set of reporting items is expected to improve the quality and reproducibility of ML modeling studies.
Assuntos
Lista de Checagem , Aprendizado de Máquina , Humanos , Prognóstico , Reprodutibilidade dos Testes , ConsensoRESUMO
Sharing health data for research purposes across international jurisdictions has been a challenge due to privacy concerns. Two privacy enhancing technologies that can enable such sharing are synthetic data generation (SDG) and federated analysis, but their relative strengths and weaknesses have not been evaluated thus far. In this study we compared SDG with federated analysis to enable such international comparative studies. The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals. The Canadian data was synthesized and sent to the Austrian team for analysis. The utility of the pooled (synthetic Canadian + real Austrian) dataset was evaluated by comparing the regression results from the two approaches. The privacy of the Canadian synthetic data was assessed using a membership disclosure test which showed an F1 score of 0.001, indicating low privacy risk. The outcome variable of interest was CVH, calculated through a modified CANHEART index. The main and interaction effect parameter estimates of the federated and pooled analyses were consistent and directionally the same. It took approximately one month to set up the synthetic data generation platform and generate the synthetic data, whereas it took over 1.5 years to set up the federated analysis system. Synthetic data generation can be an efficient and effective tool for enabling multi-jurisdictional studies while addressing privacy concerns.
Assuntos
Sistema Cardiovascular , Humanos , Canadá , Áustria , Revelação , PrivacidadeRESUMO
A status update on applying generative AI to synthetic data generation.
RESUMO
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
Assuntos
Revelação , Privacidade , Humanos , Coleta de DadosRESUMO
Aims: The aim of this study was to elucidate whether sex and gender factors influence access to health care and/or are associated with cardiovascular (CV) outcomes of individuals with diabetes mellitus (DM) across different countries. Methods: Using data from the Canadian Community Health Survey (8.4% of respondent reporting DM) and the European Health Interview Survey (7.3% of respondents reporting DM), were analyzed. Self-reported sex and a composite measure of socio-cultural gender was constructed (range: 0-1; higher score represent participants who reported more characteristics traditionally ascribed to women). For the purposes of analyses the Gender Inequality Index (GII) was used as a country level measure of institutionalized gender. Results: Canadian females with DM were more likely to undergo HbA1c monitoring compared to males (OR = 1.26, 95% CI: 1.01-1.58), while conversely in the European cohort females with DM were less likely to have their blood sugar measured compared to males (OR = 0.88, 95% CI: 0.79-0.99). A higher gender score in both cohorts was associated with less frequent diabetes monitoring. Additionally, independent of sex, higher gender scores were associated with higher prevalence of self-reported heart disease, stroke, and hospitalization in all countries albeit European countries with medium-high GII, conferred a higher risk of all outcomes and hospitalization rates than low GII countries. Conclusion: Regardless of sex, individuals with DM who reported characteristics typically ascribed to women and those living in countries with greater gender inequity for women exhibited poorer diabetes care and greater risk of CV outcomes and hospitalizations.
Assuntos
Diabetes Mellitus , Acidente Vascular Cerebral , Masculino , Humanos , Feminino , Canadá , Diabetes Mellitus/epidemiologia , Inquéritos e Questionários , Acessibilidade aos Serviços de SaúdeRESUMO
Synthetic data generation is the process of using machine learning methods to train a model that captures the patterns in a real dataset. Then new or synthetic data can be generated from that trained model. The synthetic data does not have a one-to-one mapping to the original data or to real patients, and therefore has the potential of privacy preserving properties. There is a growing interest in the application of synthetic data across health and life sciences, but to fully realize the benefits, further education, research, and policy innovation is required. This article summarizes the opportunities and challenges of SDG for health data, and provides directions for how this technology can be leveraged to accelerate data access for secondary purposes.
RESUMO
Background: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective: Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods: We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. Results: The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. Conclusions: Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
RESUMO
BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.
Assuntos
COVID-19 , COVID-19/epidemiologia , Humanos , Disseminação de Informação , Privacidade , Probabilidade , RiscoRESUMO
OBJECTIVE: To examine sex and gender roles in COVID-19 test positivity and hospitalisation in sex-stratified predictive models using machine learning. DESIGN: Cross-sectional study. SETTING: UK Biobank prospective cohort. PARTICIPANTS: Participants tested between 16 March 2020 and 18 May 2020 were analysed. MAIN OUTCOME MEASURES: The endpoints of the study were COVID-19 test positivity and hospitalisation. Forty-two individuals' demographics, psychosocial factors and comorbidities were used as likely determinants of outcomes. Gradient boosting machine was used for building prediction models. RESULTS: Of 4510 individuals tested (51.2% female, mean age=68.5±8.9 years), 29.4% tested positive. Males were more likely to be positive than females (31.6% vs 27.3%, p=0.001). In females, living in more deprived areas, lower income, increased low-density lipoprotein (LDL) to high-density lipoprotein (HDL) ratio, working night shifts and living with a greater number of family members were associated with a higher likelihood of COVID-19 positive test. While in males, greater body mass index and LDL to HDL ratio were the factors associated with a positive test. Older age and adverse cardiometabolic characteristics were the most prominent variables associated with hospitalisation of test-positive patients in both overall and sex-stratified models. CONCLUSION: High-risk jobs, crowded living arrangements and living in deprived areas were associated with increased COVID-19 infection in females, while high-risk cardiometabolic characteristics were more influential in males. Gender-related factors have a greater impact on females; hence, they should be considered in identifying priority groups for COVID-19 infection vaccination campaigns.
Assuntos
COVID-19 , Doenças Cardiovasculares , Idoso , Bancos de Espécimes Biológicos , COVID-19/epidemiologia , Estudos Transversais , Feminino , Hospitalização , Humanos , Aprendizado de Máquina , Masculino , Pessoa de Meia-Idade , Estudos Prospectivos , Reino Unido/epidemiologiaRESUMO
BACKGROUND: A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. OBJECTIVE: This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. METHODS: We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. RESULTS: The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. CONCLUSIONS: This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
RESUMO
This article provides a state-of-the-art summary of location privacy issues and geoprivacy-preserving methods in public health interventions and health research involving disaggregate geographic data about individuals. Synthetic data generation (from real data using machine learning) is discussed in detail as a promising privacy-preserving approach. To fully achieve their goals, privacy-preserving methods should form part of a wider comprehensive socio-technical framework for the appropriate disclosure, use and dissemination of data containing personal identifiable information. Select highlights are also presented from a related December 2021 AAG (American Association of Geographers) webinar that explored ethical and other issues surrounding the use of geospatial data to address public health issues during challenging crises, such as the COVID-19 pandemic.
Assuntos
COVID-19 , Privacidade , Confidencialidade , Humanos , Pandemias , Saúde Pública , SARS-CoV-2 , Justiça SocialRESUMO
Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].