Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Proc Natl Acad Sci U S A ; 120(43): e2220558120, 2023 Oct 24.
Artigo em Inglês | MEDLINE | ID: mdl-37831744

RESUMO

The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. We argue that any proposal for quantifying disclosure risk should be based on prespecified, objective criteria. We illustrate this approach to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. More research is needed, but in the near term, the counterfactual approach appears best-suited for privacy versus utility analysis.


Assuntos
Confidencialidade , Revelação , Privacidade , Medição de Risco , Censos
2.
Proc Natl Acad Sci U S A ; 119(31): e2104906119, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35878030

RESUMO

The federal statistical system is experiencing competing pressures for change. On the one hand, for confidentiality reasons, much socially valuable data currently held by federal agencies is either not made available to researchers at all or only made available under onerous conditions. On the other hand, agencies which release public databases face new challenges in protecting the privacy of the subjects in those databases, which leads them to consider releasing fewer data or masking the data in ways that will reduce their accuracy. In this essay, we argue that the discussion has not given proper consideration to the reduced social benefits of data availability and their usability relative to the value of increased levels of privacy protection. A more balanced benefit-cost framework should be used to assess these trade-offs. We express concerns both with synthetic data methods for disclosure limitation, which will reduce the types of research that can be reliably conducted in unknown ways, and with differential privacy criteria that use what we argue is an inappropriate measure of disclosure risk. We recommend that the measure of disclosure risk used to assess all disclosure protection methods focus on what we believe is the risk that individuals should care about, that more study of the impact of differential privacy criteria and synthetic data methods on data usability for research be conducted before either is put into widespread use, and that more research be conducted on alternative methods of disclosure risk reduction that better balance benefits and costs.


Assuntos
Segurança Computacional , Confidencialidade , Privacidade , Coleta de Dados , Revelação , Governo Federal , Órgãos Governamentais
3.
Popul Health Metr ; 21(1): 19, 2023 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-37907904

RESUMO

BACKGROUND: To develop public health intervention models using micro-simulations, extensive personal information about inhabitants is needed, such as socio-demographic, economic and health figures. Confidentiality is an essential characteristic of such data, while the data should reflect realistic scenarios. Collection of such data is possible only in secured environments and not directly available for open-source micro-simulation models. The aim of this paper is to illustrate a method of construction of synthetic data by predicting individual features through models based on confidential data on health and socio-economic determinants of the entire Dutch population. METHODS: Administrative records and health registry data were linked to socio-economic characteristics and self-reported lifestyle factors. For the entire Dutch population (n = 16,778,708), all socio-demographic information except lifestyle factors was available. Lifestyle factors were available from the 2012 Dutch Health Monitor (n = 370,835). Regression model was used to sequentially predict individual features. RESULTS: The synthetic population resembles the original confidential population. Features predicted in the first stages of the sequential procedure are virtually similar to those in the original population, while those predicted in later stages of the sequential procedure carry the accumulation of limitations furthered by data quality and previously modelled features. CONCLUSIONS: By combining socio-demographic, economic, health and lifestyle related data at individual level on a large scale, our method provides us with a powerful tool to construct a synthetic population of good quality and with no confidentiality issues.


Assuntos
Big Data , Estilo de Vida , Humanos
4.
BMC Med Inform Decis Mak ; 22(1): 24, 2022 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-35090447

RESUMO

BACKGROUND: Data privacy is one of the biggest challenges for any organisation which processes personal data, especially in the area of medical research where data include sensitive information about patients and study participants. Sharing of data is therefore problematic, which is at odds with the principle of open data that is so important to the advancement of society and science. Several statistical methods and computational tools have been developed to help data custodians and analysts overcome this challenge. METHODS: In this paper, we propose a new deterministic approach for anonymising personal data. The method stratifies the underlying data by the categorical variables and re-distributes the continuous variables through a k nearest neighbours based algorithm. RESULTS: We demonstrate the use of the deterministic anonymisation on real data, including data from a sample of Titanic passengers, and data from participants in the 1958 Birth Cohort. CONCLUSIONS: The proposed procedure makes data re-identification difficult while minimising the loss of utility (by preserving the spatial properties of the underlying data); the latter means that informative statistical analysis can still be conducted.


Assuntos
Pesquisa Biomédica , Privacidade , Anonimização de Dados , Humanos
5.
Entropy (Basel) ; 24(5)2022 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-35626554

RESUMO

Preserving confidentiality of individuals in data disclosure is a prime concern for public and private organizations. The main challenge in the data disclosure problem is to release data such that misuse by intruders is avoided while providing useful information to legitimate users for analysis. We propose an information theoretic architecture for the data disclosure problem. The proposed framework consists of developing a maximum entropy (ME) model based on statistical information of the actual data, testing the adequacy of the ME model, producing disclosure data from the ME model and quantifying the discrepancy between the actual and the disclosure data. The architecture can be used both for univariate and multivariate data disclosure. We illustrate the implementation of our approach using financial data.

6.
Stat Med ; 37(25): 3693-3706, 2018 11 10.
Artigo em Inglês | MEDLINE | ID: mdl-29931695

RESUMO

Statistical agencies are releasing statistical data to other agencies for research purposes or to inform public policy. Prior to data release, these agencies have a legal and ethical obligation to protect the confidentiality of individuals in the data. Agencies often release altered versions of the data, but there usually remains risks of disclosure. Many well-studied risk measures are available to assess risk; however, many agencies today continue to use subjective judgement, past experience, and ad hoc rules or checklists to assess disclosure risk. More recently, there has been a recognized demand for quantitative risk measures that provide a more objective criteria for data release. This tutorial provides an overview of the statistical disclosure control framework for microdata. We focus on the risk analysis stage within this framework by defining existing disclosure risk measures and how to estimate them with available software.


Assuntos
Confidencialidade , Revelação , Medição de Risco , Estatística como Assunto , Algoritmos , Confidencialidade/ética , Revelação/ética , Humanos , Modelos Estatísticos , Fatores de Risco , Software , Estatística como Assunto/ética
7.
Stat Med ; 32(24): 4139-61, 2013 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-23670983

RESUMO

Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality.


Assuntos
Confidencialidade/normas , Revelação/normas , Inquéritos Epidemiológicos/métodos , Estudos Observacionais como Assunto/métodos , Inquéritos e Questionários/normas , Neoplasias Colorretais/epidemiologia , Feminino , Inquéritos Epidemiológicos/normas , Humanos , Neoplasias Pulmonares/epidemiologia , Masculino
8.
J Am Med Inform Assoc ; 28(4): 801-811, 2021 03 18.
Artigo em Inglês | MEDLINE | ID: mdl-33367620

RESUMO

OBJECTIVE: This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. MATERIALS AND METHODS: We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. RESULTS: Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. DISCUSSION: Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. CONCLUSION: We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.


Assuntos
Teorema de Bayes , Anonimização de Dados , Gerenciamento de Dados , Aprendizado de Máquina , Redes Neurais de Computação , Confidencialidade , Conjuntos de Dados como Assunto , Revelação , Humanos , Disseminação de Informação
9.
J Empir Res Hum Res Ethics ; 13(3): 203-222, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29683056

RESUMO

Participatory sensing applications collect personal data of monitored subjects along with their spatial or spatiotemporal stamps. The attributes of a monitored subject can be private, sensitive, or confidential information. Also, the spatial or spatiotemporal attributes are prone to inferential disclosure of private information. Although there is extensive problem-oriented literature on geoinformation disclosure, our work provides a clear guideline with practical relevance, containing the steps that a research campaign should follow to preserve the participants' privacy. We first examine the technical aspects of geoprivacy in the context of participatory sensing data. Then, we propose privacy-preserving steps in four categories, namely, ensuring secure and safe settings, actions prior to the start of a research survey, processing and analysis of collected data, and safe disclosure of datasets and research deliverables.


Assuntos
Confidencialidade , Coleta de Dados/métodos , Guias como Assunto , Monitorização Ambulatorial/métodos , Privacidade , Tecnologia de Sensoriamento Remoto , Projetos de Pesquisa , Análise de Dados , Revelação , Humanos , Aplicativos Móveis , Análise Espacial , Inquéritos e Questionários
10.
Spat Spatiotemporal Epidemiol ; 27: 37-45, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30409375

RESUMO

When agencies release public-use data, they must be cognizant of the potential risk of disclosure associated with making their data publicly available. This issue is particularly pertinent in disease mapping, where small counts pose both inferential challenges and potential disclosure risks. While the small area estimation, disease mapping, and statistical disclosure limitation literatures are individually robust, there have been few intersections between them. Here, we formally propose the use of spatiotemporal data analysis methods to generate synthetic data for public use. Specifically, we analyze ten years of county-level heart disease death counts for multiple age-groups using a Bayesian model that accounts for dependence spatially, temporally, and between age-groups; generating synthetic data from the resulting posterior predictive distribution will preserve these dependencies. After demonstrating the synthetic data's privacy-preserving features, we illustrate their utility by comparing estimates of urban/rural disparities from the synthetic data to those from data with small counts suppressed.


Assuntos
Confidencialidade/normas , Análise Espaço-Temporal , Topografia Médica , Teorema de Bayes , Revelação , Mapeamento Geográfico , Humanos , Modelos Estatísticos , Risco , Topografia Médica/ética , Topografia Médica/métodos , Topografia Médica/estatística & dados numéricos
11.
ACM J Data Inf Qual ; 7(4)2016 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-27867450

RESUMO

Medical and health data are often collected for studying a specific disease. For such same-disease microdata, a privacy disclosure occurs as long as an individual is known to be in the microdata. Individuals in same-disease microdata are thus subject to higher disclosure risk than those in microdata with different diseases. This important problem has been overlooked in data-privacy research and practice, and no prior study has addressed this problem. In this study, we analyze the disclosure risk for the individuals in same-disease microdata and propose a new metric that is appropriate for measuring disclosure risk in this situation. An efficient algorithm is designed and implemented for anonymizing same-disease data to minimize the disclosure risk while keeping data utility as good as possible. An experimental study was conducted on real patient and population data. Experimental results show that traditional reidentification risk measures underestimate the actual disclosure risk for the individuals in same-disease microdata and demonstrate that the proposed approach is very effective in reducing the actual risk for same-disease data. This study suggests that privacy protection policy and practice for sharing medical and health data should consider not only the individuals' identifying attributes but also the health and disease information contained in the data. It is recommended that data-sharing entities employ a statistical approach, instead of the HIPAA's Safe Harbor policy, when sharing same-disease microdata.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA