Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36215114

RESUMO

Precision medicine relies on molecular and systems biology methods as well as bidirectional association studies of phenotypes and (high-throughput) genomic data. However, the integrated use of such data often faces obstacles, especially in regards to data protection. An important prerequisite for research data processing is usually informed consent. But collecting consent is not always feasible, in particular when data are to be analyzed retrospectively. For phenotype data, anonymization, i.e. the altering of data in such a way that individuals cannot be identified, can provide an alternative. Several re-identification attacks have shown that this is a complex task and that simply removing directly identifying attributes such as names is usually not enough. More formal approaches are needed that use mathematical models to quantify risks and guide their reduction. Due to the complexity of these techniques, it is challenging and not advisable to implement them from scratch. Open software libraries and tools can provide a robust alternative. However, also the range of available anonymization tools is heterogeneous and obtaining an overview of their strengths and weaknesses is difficult due to the complexity of the problem space. We therefore performed a systematic review of open anonymization tools for structured phenotype data described in the literature between 1990 and 2021. Through a two-step eligibility assessment process, we selected 13 tools for an in-depth analysis. By comparing the supported anonymization techniques and further aspects, such as maturity, we derive recommendations for tools to use for anonymizing phenotype datasets with different properties.


Assuntos
Pesquisa Biomédica , Privacidade , Estudos Retrospectivos , Anonimização de Dados , Fenótipo
2.
J Med Internet Res ; 25: e43060, 2023 10 04.
Artigo em Inglês | MEDLINE | ID: mdl-37792443

RESUMO

BACKGROUND: YouTube has become a popular source of health care information, reaching an estimated 81% of adults in 2021; approximately 35% of adults in the United States have used the internet to self-diagnose a condition. Public health researchers are therefore incorporating YouTube data into their research, but guidelines for best practices around research ethics using social media data, such as YouTube, are unclear. OBJECTIVE: This study aims to describe approaches to research ethics for public health research implemented using YouTube data. METHODS: We implemented a systematic review of articles found in PubMed, SocINDEX, Web of Science, and PsycINFO following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. To be eligible to be included, studies needed to be published in peer-reviewed journals in English between January 1, 2006, and October 31, 2019, and include analyses on publicly available YouTube data on health or public health topics; studies using primary data collection, such as using YouTube for study recruitment, interventions, or dissemination evaluations, were not included. We extracted data on the presence of user identifying information, institutional review board (IRB) review, and informed consent processes, as well as research topic and methodology. RESULTS: This review includes 119 articles from 88 journals. The most common health and public health topics studied were in the categories of chronic diseases (44/119, 37%), mental health and substance use (26/119, 21.8%), and infectious diseases (20/119, 16.8%). The majority (82/119, 68.9%) of articles made no mention of ethical considerations or stated that the study did not meet the definition of human participant research (16/119, 13.4%). Of those that sought IRB review (15/119, 12.6%), 12 out of 15 (80%) were determined to not meet the definition of human participant research and were therefore exempt from IRB review, and 3 out of 15 (20%) received IRB approval. None of the 3 IRB-approved studies contained identifying information; one was explicitly told not to include identifying information by their ethics committee. Only 1 study sought informed consent from YouTube users. Of 119 articles, 33 (27.7%) contained identifying information about content creators or video commenters, one of which attempted to anonymize direct quotes by not including user information. CONCLUSIONS: Given the variation in practice, concrete guidelines on research ethics for social media research are needed, especially around anonymizing and seeking consent when using identifying information. TRIAL REGISTRATION: PROSPERO CRD42020148170; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=148170.


Assuntos
Ética em Pesquisa , Mídias Sociais , Adulto , Humanos , Coleta de Dados , Comitês de Ética em Pesquisa , Consentimento Livre e Esclarecido
3.
Entropy (Basel) ; 25(12)2023 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-38136493

RESUMO

Data anonymization is a technique that safeguards individuals' privacy by modifying attribute values in published data. However, increased modifications enhance privacy but diminish the utility of published data, necessitating a balance between privacy and utility levels. K-Anonymity is a crucial anonymization technique that generates k-anonymous clusters, where the probability of disclosing a record is 1/k. However, k-anonymity fails to protect against attribute disclosure when the diversity of sensitive values within the anonymous cluster is insufficient. Several techniques have been proposed to address this issue, among which t-closeness is considered one of the most robust privacy techniques. In this paper, we propose a novel approach employing a greedy and information-theoretic clustering-based algorithm to achieve strict privacy protection. The proposed anonymization algorithm commences by clustering the data based on both the similarity of quasi-identifier values and the diversity of sensitive attribute values. In the subsequent adjustment phase, the algorithm splits and merges the clusters to ensure that they each possess at least k members and adhere to the t-closeness requirements. Finally, the algorithm replaces the quasi-identifier values of the records in each cluster with the values of the cluster center to attain k-anonymity and t-closeness. Experimental results on three microdata sets from Facebook, Twitter, and Google+ demonstrate the proposed algorithm's ability to preserve the utility of released data by minimizing the modifications of attribute values while satisfying the k-anonymity and t-closeness constraints.

4.
Hum Brain Mapp ; 42(11): 3643-3655, 2021 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-33973694

RESUMO

Surface rendering of MRI brain scans may lead to identification of the participant through facial characteristics. In this study, we evaluate three methods that overwrite voxels containing privacy-sensitive information: Face Masking, FreeSurfer defacing, and FSL defacing. We included structural T1-weighted MRI scans of children, young adults and older adults. For the young adults, test-retest data were included with a 1-week interval. The effects of the de-identification methods were quantified using different statistics to capture random variation and systematic noise in measures obtained through the FreeSurfer processing pipeline. Face Masking and FSL defacing impacted brain voxels in some scans especially in younger participants. FreeSurfer defacing left brain tissue intact in all cases. FSL defacing and FreeSurfer defacing preserved identifiable characteristics around the eyes or mouth in some scans. For all de-identification methods regional brain measures of subcortical volume, cortical volume, cortical surface area, and cortical thickness were on average highly replicable when derived from original versus de-identified scans with average regional correlations >.90 for children, young adults, and older adults. Small systematic biases were found that incidentally resulted in significantly different brain measures after de-identification, depending on the studied subsample, de-identification method, and brain metric. In young adults, test-retest intraclass correlation coefficients (ICCs) were comparable for original scans and de-identified scans with average regional ICCs >.90 for (sub)cortical volume and cortical surface area and ICCs >.80 for cortical thickness. We conclude that apparent visual differences between de-identification methods minimally impact reliability of brain measures, although small systematic biases can occur.


Assuntos
Encéfalo/diagnóstico por imagem , Anonimização de Dados , Processamento de Imagem Assistida por Computador , Imageamento por Ressonância Magnética , Neuroimagem , Adulto , Fatores Etários , Idoso , Idoso de 80 Anos ou mais , Córtex Cerebral , Criança , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Adulto Jovem
5.
J Korean Med Sci ; 36(44): e299, 2021 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-34783216

RESUMO

Personal medical information is an essential resource for research; however, there are laws that regulate its use, and it typically has to be pseudonymized or anonymized. When data are anonymized, the quantity and quality of extractable information decrease significantly. From the perspective of a clinical researcher, a method of achieving pseudonymized data without degrading data quality while also preventing data loss is proposed herein. As the level of pseudonymization varies according to the research purpose, the pseudonymization method applied should be carefully chosen. Therefore, the active participation of clinicians is crucial to transform the data according to the research purpose. This can contribute to data security by simply transforming the data through secondary data processing. Case studies demonstrated that, compared with the initial baseline data, there was a clinically significant difference in the number of datapoints added with the participation of a clinician (from 267,979 to 280,127 points, P < 0.001). Thus, depending on the degree of clinician participation, data anonymization may not affect data quality and quantity, and proper data quality management along with data security are emphasized. Although the pseudonymization level and clinical use of data have a trade-off relationship, it is possible to create pseudonymized data while maintaining the data quality required for a given research purpose. Therefore, rather than relying solely on security guidelines, the active participation of clinicians is important.


Assuntos
Confiabilidade dos Dados , Anonimização de Dados , Pesquisa Biomédica , Doenças Cardiovasculares/patologia , Anonimização de Dados/legislação & jurisprudência , Humanos
6.
J Biomed Inform ; 107: 103436, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32428572

RESUMO

The free-form portions of clinical notes are a significant source of information for research, but before they can be used, they must be de-identified to protect patients' privacy. De-identification efforts have focused on known identifier types (names, ages, dates, addresses, ID's, etc.). However, a note can contain residual "Demographic Traits" (DTs), unique enough to re-identify the patient when combined with other such facts. Here we examine whether any residual risks remain after removing these identifiers. After manually annotating over 140,000 words worth of medical notes, we found no remaining directly identifying information, and a low prevalence of demographic traits, such as marital status or housing type. We developed an annotation guide to the discovered Demographic Traits (DTs) and used it to label MIMIC-III and i2b2-2006 clinical notes as test sets. We then designed a "bootstrapped" active learning iterative process for identifying DTs: we tentatively labeled as positive all sentences in the DT-rich note sections, used these to train a binary classifier, manually corrected acute errors, and retrained the classifier. This train-and-correct process may be iterated. Our active learning process significantly improved the classifier's accuracy. Moreover, our BERT-based model outperformed non-neural models when trained on both tentatively labeled data and manually relabeled examples. To facilitate future research and benchmarking, we also produced and made publicly available our human annotated DT-tagged datasets. We conclude that directly identifying information is virtually non-existent in the multiple medical note types we investigated. Demographic traits are present in medical notes, but can be detected with high accuracy using a cost-effective human-in-the-loop active learning process, and redacted if desired.2.


Assuntos
Aprendizado Profundo , Confidencialidade , Demografia , Humanos , Fenótipo , Aprendizagem Baseada em Problemas
7.
J Med Internet Res ; 22(7): e18055, 2020 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-32673230

RESUMO

BACKGROUND: Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE: This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS: We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS: We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient's name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS: Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.


Assuntos
Confidencialidade/tendências , Processamento de Linguagem Natural , Algoritmos , Humanos
8.
BMC Med Inform Decis Mak ; 20(1): 29, 2020 02 11.
Artigo em Inglês | MEDLINE | ID: mdl-32046701

RESUMO

BACKGROUND: Modern data driven medical research promises to provide new insights into the development and course of disease and to enable novel methods of clinical decision support. To realize this, machine learning models can be trained to make predictions from clinical, paraclinical and biomolecular data. In this process, privacy protection and regulatory requirements need careful consideration, as the resulting models may leak sensitive personal information. To counter this threat, a wide range of methods for integrating machine learning with formal methods of privacy protection have been proposed. However, there is a significant lack of practical tools to create and evaluate such privacy-preserving models. In this software article, we report on our ongoing efforts to bridge this gap. RESULTS: We have extended the well-known ARX anonymization tool for biomedical data with machine learning techniques to support the creation of privacy-preserving prediction models. Our methods are particularly well suited for applications in biomedicine, as they preserve the truthfulness of data (e.g. no noise is added) and they are intuitive and relatively easy to explain to non-experts. Moreover, our implementation is highly versatile, as it supports binomial and multinomial target variables, different types of prediction models and a wide range of privacy protection techniques. All methods have been integrated into a sound framework that supports the creation, evaluation and refinement of models through intuitive graphical user interfaces. To demonstrate the broad applicability of our solution, we present three case studies in which we created and evaluated different types of privacy-preserving prediction models for breast cancer diagnosis, diagnosis of acute inflammation of the urinary system and prediction of the contraceptive method used by women. In this process, we also used a wide range of different privacy models (k-anonymity, differential privacy and a game-theoretic approach) as well as different data transformation techniques. CONCLUSIONS: With the tool presented in this article, accurate prediction models can be created that preserve the privacy of individuals represented in the training set in a variety of threat scenarios. Our implementation is available as open source software.


Assuntos
Confidencialidade , Anonimização de Dados , Sistemas de Apoio a Decisões Clínicas , Modelos Estatísticos , Software , Pesquisa Biomédica , Humanos , Aprendizado de Máquina , Curva ROC , Reprodutibilidade dos Testes
9.
BMC Med Inform Decis Mak ; 20(1): 155, 2020 07 08.
Artigo em Inglês | MEDLINE | ID: mdl-32641043

RESUMO

BACKGROUND: Various methods based on k-anonymity have been proposed for publishing medical data while preserving privacy. However, the k-anonymity property assumes that adversaries possess fixed background knowledge. Although differential privacy overcomes this limitation, it is specialized for aggregated results. Thus, it is difficult to obtain high-quality microdata. To address this issue, we propose a differentially private medical microdata release method featuring high utility. METHODS: We propose a method of anonymizing medical data under differential privacy. To improve data utility, especially by preserving informative attribute values, the proposed method adopts three data perturbation approaches: (1) generalization, (2) suppression, and (3) insertion. The proposed method produces an anonymized dataset that is nearly optimal with regard to utility, while preserving privacy. RESULTS: The proposed method achieves lower information loss than existing methods. Based on a real-world case study, we prove that the results of data analyses using the original dataset and those obtained using a dataset anonymized via the proposed method are considerably similar. CONCLUSIONS: We propose a novel differentially private anonymization method that preserves informative values for the release of medical data. Through experiments, we show that the utility of medical data that has been anonymized via the proposed method is significantly better than that of existing methods.


Assuntos
Anonimização de Dados , Envio de Mensagens de Texto , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Pré-Escolar , Feminino , Humanos , Decoração de Interiores e Mobiliário , Conhecimento , Masculino , Pessoa de Meia-Idade , Privacidade , Adulto Jovem
10.
J Med Internet Res ; 21(2): e11985, 2019 02 21.
Artigo em Inglês | MEDLINE | ID: mdl-30789346

RESUMO

With the expansion and popularity of research on websites such as Facebook and Twitter, there has been increasing concern about investigator conduct and social media ethics. The availability of large data sets has attracted researchers who are not traditionally associated with health data and its associated ethical considerations, such as computer and data scientists. Reliance on oversight by ethics review boards is inadequate and, due to the public availability of social media data, there is often confusion between public and private spaces. In addition, social media participants and researchers may pay little attention to traditional terms of use. In this paper, we review four cases involving ethical and terms-of-use violations by researchers seeking to conduct social media studies in an online patient research network. These violations involved unauthorized scraping of social media data, entry of false information, misrepresentation of researcher identities of participants on forums, lack of ethical approval and informed consent, use of member quotations, and presentation of findings at conferences and in journals without verifying accurate potential biases and limitations of the data. The correction of these ethical lapses often involves much effort in detecting and responding to violators, addressing these lapses with members of an online community, and correcting inaccuracies in the literature (including retraction of publications and conference presentations). Despite these corrective actions, we do not regard these episodes solely as violations. Instead, they represent broader ethical issues that may arise from potential sources of confusion, misinformation, inadequacies in applying traditional informed consent procedures to social media research, and differences in ethics training and scientific methodology across research disciplines. Social media research stakeholders need to assure participants that their studies will not compromise anonymity or lead to harmful outcomes while preserving the societal value of their health-related studies. Based on our experience and published recommendations by social media researchers, we offer potential directions for future prevention-oriented measures that can be applied by data producers, computer/data scientists, institutional review boards, research ethics committees, and publishers.


Assuntos
Processamento Eletrônico de Dados/métodos , Comitês de Ética em Pesquisa/normas , Mídias Sociais/ética , Humanos , Internet
11.
J Med Internet Res ; 21(4): e12300, 2019 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-30977738

RESUMO

BACKGROUND: Clinical and social trials create evidence that enables medical progress. However, the gathering of personal and patient data requires high security and privacy standards. Direct linking of personal information and medical data is commonly hidden through pseudonymization. While this makes unauthorized access to personal medical data more difficult, a centralized pseudonymization list can still pose a security risk. In addition, medical data linked via pseudonyms can still be used for data-driven reidentification. OBJECTIVE: Our objective was to propose a novel approach to pseudonymization based on public-private key cryptography that allows (1) decentralized patient-driven creation and maintenance of pseudonyms, (2) 1-time pseudonymization of each data record, and (3) grouping of patient data records even without knowing the pseudonymization key. METHODS: Based on public-private key cryptography, we set up a signing mechanism for patient data records and detailed the workflows for (1) user registration, (2) user log-in, (3) record storing, and (4) record grouping. We evaluated the proposed mechanism for performance, examined the potential risks based on cryptographic collision, and carried out a threat analysis. RESULTS: The performance analysis showed that all workflows could be performed with an average runtime of 0.057 to 42.320 ms (user registration), 0.083 to 0.606 ms (record creation), and 0.005 to 0.198 ms (record grouping) depending on the chosen cryptographic tools. We expected no realistic risk of cryptographic collision in the proposed system, and the threat analysis revealed that 3 distinct server systems of the proposed setup had to be compromised to allow access to combined medical data and private data. However, this would still allow only for data-driven deidentification. For a full reidentification, all 3 trial servers and all study participants would have to be compromised. In addition, the approach supports consent management, automatically anonymizes the data after trial closure, and provides basic mechanisms against data forging. CONCLUSIONS: The proposed approach has a high security and privacy level in comparison with traditional centralized pseudonymization approaches and does not require a trusted third party. The only drawback in comparison with central pseudonymization is the directed feedback of accidental findings to individual participants, as this is not possible with a quasi-anonymous storage of patient data.


Assuntos
Segurança Computacional/normas , Confidencialidade/normas , Sistemas Computadorizados de Registros Médicos/normas , Estudos de Viabilidade , Humanos , Modelos Teóricos
12.
J Med Internet Res ; 21(8): e14126, 2019 08 06.
Artigo em Inglês | MEDLINE | ID: mdl-31389335

RESUMO

BACKGROUND: There has been significant effort in attempting to use health care data. However, laws that protect patients' privacy have restricted data use because health care data contain sensitive information. Thus, discussions on privacy laws now focus on the active use of health care data beyond protection. However, current literature does not clarify the obstacles that make data usage and deidentification processes difficult or elaborate on users' needs for data linking from practical perspectives. OBJECTIVE: The objective of this study is to investigate (1) the current status of data use in each medical area, (2) institutional efforts and difficulties in deidentification processes, and (3) users' data linking needs. METHODS: We conducted a cross-sectional online survey. To recruit people who have used health care data, we publicized the promotion campaign and sent official documents to an academic society encouraging participation in the online survey. RESULTS: In total, 128 participants responded to the online survey; 10 participants were excluded for either inconsistent responses or lack of demand for health care data. Finally, 118 participants' responses were analyzed. The majority of participants worked in general hospitals or universities (62/118, 52.5% and 51/118, 43.2%, respectively, multiple-choice answers). More than half of participants responded that they have a need for clinical data (82/118, 69.5%) and public data (76/118, 64.4%). Furthermore, 85.6% (101/118) of respondents conducted deidentification measures when using data, and they considered rigid social culture as an obstacle for deidentification (28/101, 27.7%). In addition, they required data linking (98/118, 83.1%), and they noted deregulation and data standardization to allow access to health care data linking (33/98, 33.7% and 38/98, 38.8%, respectively). There were no significant differences in the proportion of responded data needs and linking in groups that used health care data for either public purposes or commercial purposes. CONCLUSIONS: This study provides a cross-sectional view from a practical, user-oriented perspective on the kinds of data users want to utilize, efforts and difficulties in deidentification processes, and the needs for data linking. Most users want to use clinical and public data, and most participants conduct deidentification processes and express a desire to conduct data linking. Our study confirmed that they noted regulation as a primary obstacle whether their purpose is commercial or public. A legal system based on both data utilization and data protection needs is required.


Assuntos
Acesso à Informação , Barreiras de Comunicação , Segurança Computacional , Bases de Dados Factuais , Adulto , Estudos Transversais , Feminino , Humanos , Internet , Masculino , Pessoa de Meia-Idade , República da Coreia , Inquéritos e Questionários , Adulto Jovem
13.
Entropy (Basel) ; 20(5)2018 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-33265463

RESUMO

The topic of big data has attracted increasing interest in recent years. The emergence of big data leads to new difficulties in terms of protection models used for data privacy, which is of necessity for sharing and processing data. Protecting individuals' sensitive information while maintaining the usability of the data set published is the most important challenge in privacy preserving. In this regard, data anonymization methods are utilized in order to protect data against identity disclosure and linking attacks. In this study, a novel data anonymization algorithm based on chaos and perturbation has been proposed for privacy and utility preserving in big data. The performance of the proposed algorithm is evaluated in terms of Kullback-Leibler divergence, probabilistic anonymity, classification accuracy, F-measure and execution time. The experimental results have shown that the proposed algorithm is efficient and performs better in terms of Kullback-Leibler divergence, classification accuracy and F-measure compared to most of the existing algorithms using the same data set. Resulting from applying chaos to perturb data, such successful algorithm is promising to be used in privacy preserving data mining and data publishing.

14.
BMC Med Inform Decis Mak ; 17(1): 104, 2017 Jul 11.
Artigo em Inglês | MEDLINE | ID: mdl-28693480

RESUMO

BACKGROUND: Publishing raw electronic health records (EHRs) may be considered as a breach of the privacy of individuals because they usually contain sensitive information. A common practice for the privacy-preserving data publishing is to anonymize the data before publishing, and thus satisfy privacy models such as k-anonymity. Among various anonymization techniques, generalization is the most commonly used in medical/health data processing. Generalization inevitably causes information loss, and thus, various methods have been proposed to reduce information loss. However, existing generalization-based data anonymization methods cannot avoid excessive information loss and preserve data utility. METHODS: We propose a utility-preserving anonymization for privacy preserving data publishing (PPDP). To preserve data utility, the proposed method comprises three parts: (1) utility-preserving model, (2) counterfeit record insertion, (3) catalog of the counterfeit records. We also propose an anonymization algorithm using the proposed method. Our anonymization algorithm applies full-domain generalization algorithm. We evaluate our method in comparison with existence method on two aspects, information loss measured through various quality metrics and error rate of analysis result. RESULTS: With all different types of quality metrics, our proposed method show the lower information loss than the existing method. In the real-world EHRs analysis, analysis results show small portion of error between the anonymized data through the proposed method and original data. CONCLUSIONS: We propose a new utility-preserving anonymization method and an anonymization algorithm using the proposed method. Through experiments on various datasets, we show that the utility of EHRs anonymized by the proposed method is significantly better than those anonymized by previous approaches.


Assuntos
Anonimização de Dados , Registros Eletrônicos de Saúde , Informática Médica , Privacidade , Humanos
15.
Knowl Based Syst ; 115: 15-26, 2017 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-28603388

RESUMO

Preserving privacy and utility during data publishing and data mining is essential for individuals, data providers and researchers. However, studies in this area typically assume that one individual has only one record in a dataset, which is unrealistic in many applications. Having multiple records for an individual leads to new privacy leakages. We call such a dataset a 1:M dataset. In this paper, we propose a novel privacy model called (k, l)-diversity that addresses disclosure risks in 1:M data publishing. Based on this model, we develop an efficient algorithm named 1:M-Generalization to preserve privacy and data utility, and compare it with alternative approaches. Extensive experiments on real-world data show that our approach outperforms the state-of-the-art technique, in terms of data utility and computational cost.

16.
BMC Med Inform Decis Mak ; 16 Suppl 1: 58, 2016 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-27454754

RESUMO

BACKGROUND: To facilitate long-term safety surveillance of marketing drugs, many spontaneously reporting systems (SRSs) of ADR events have been established world-wide. Since the data collected by SRSs contain sensitive personal health information that should be protected to prevent the identification of individuals, it procures the issue of privacy preserving data publishing (PPDP), that is, how to sanitize (anonymize) raw data before publishing. Although much work has been done on PPDP, very few studies have focused on protecting privacy of SRS data and none of the anonymization methods is favorable for SRS datasets, due to which contain some characteristics such as rare events, multiple individual records, and multi-valued sensitive attributes. METHODS: We propose a new privacy model called MS(k, θ (*) )-bounding for protecting published spontaneous ADE reporting data from privacy attacks. Our model has the flexibility of varying privacy thresholds, i.e., θ (*) , for different sensitive values and takes the characteristics of SRS data into consideration. We also propose an anonymization algorithm for sanitizing the raw data to meet the requirements specified through the proposed model. Our algorithm adopts a greedy-based clustering strategy to group the records into clusters, conforming to an innovative anonymization metric aiming to minimize the privacy risk as well as maintain the data utility for ADR detection. Empirical study was conducted using FAERS dataset from 2004Q1 to 2011Q4. We compared our model with four prevailing methods, including k-anonymity, (X, Y)-anonymity, Multi-sensitive l-diversity, and (α, k)-anonymity, evaluated via two measures, Danger Ratio (DR) and Information Loss (IL), and considered three different scenarios of threshold setting for θ (*) , including uniform setting, level-wise setting and frequency-based setting. We also conducted experiments to inspect the impact of anonymized data on the strengths of discovered ADR signals. RESULTS: With all three different threshold settings for sensitive value, our method can successively prevent the disclosure of sensitive values (nearly all observed DRs are zeros) without sacrificing too much of data utility. With non-uniform threshold setting, level-wise or frequency-based, our MS(k, θ (*))-bounding exhibits the best data utility and the least privacy risk among all the models. The experiments conducted on selected ADR signals from MedWatch show that only very small difference on signal strength (PRR or ROR) were observed. The results show that our method can effectively prevent the disclosure of patient sensitive information without sacrificing data utility for ADR signal detection. CONCLUSIONS: We propose a new privacy model for protecting SRS data that possess some characteristics overlooked by contemporary models and an anonymization algorithm to sanitize SRS data in accordance with the proposed model. Empirical evaluation on the real SRS dataset, i.e., FAERS, shows that our method can effectively solve the privacy problem in SRS data without influencing the ADR signal strength.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos/normas , Anonimização de Dados , Modelos Teóricos , Privacidade , Humanos
17.
JMIR Bioinform Biotechnol ; 5: e54332, 2024 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-38935957

RESUMO

BACKGROUND: Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation. OBJECTIVE: This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets. METHODS: We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success. RESULTS: From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants. CONCLUSIONS: On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.

18.
Insights Imaging ; 15(1): 130, 2024 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-38816658

RESUMO

Artificial intelligence (AI) is revolutionizing the field of medical imaging, holding the potential to shift medicine from a reactive "sick-care" approach to a proactive focus on healthcare and prevention. The successful development of AI in this domain relies on access to large, comprehensive, and standardized real-world datasets that accurately represent diverse populations and diseases. However, images and data are sensitive, and as such, before using them in any way the data needs to be modified to protect the privacy of the patients. This paper explores the approaches in the domain of five EU projects working on the creation of ethically compliant and GDPR-regulated European medical imaging platforms, focused on cancer-related data. It presents the individual approaches to the de-identification of imaging data, and describes the problems and the solutions adopted in each case. Further, lessons learned are provided, enabling future projects to optimally handle the problem of data de-identification. CRITICAL RELEVANCE STATEMENT: This paper presents key approaches from five flagship EU projects for the de-identification of imaging and clinical data offering valuable insights and guidelines in the domain. KEY POINTS: ΑΙ models for health imaging require access to large amounts of data. Access to large imaging datasets requires an appropriate de-identification process. This paper provides de-identification guidelines from the AI for health imaging (AI4HI) projects.

19.
Can J Public Health ; 115(4): 680-687, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38806937

RESUMO

SETTING: The potential for exposure to indoor radon varies dramatically across British Columbia (BC) due to varied geology. Individuals may struggle to understand their exposure risk and agencies may struggle to understand the value of population-level programs and policies to mitigate risk. INTERVENTION: The BC Centre for Disease Control (BCCDC) established the BC Radon Data Repository (BCRDR) to facilitate radon research, public awareness, and action in the province. The BCRDR aggregates indoor radon measurements collected by government agencies, industry professionals and organizations, and research and advocacy groups. Participation was formalized with a data sharing agreement, which outlines how the BCCDC anonymizes and manages the shared data integrated into the BCRDR. OUTCOMES: The BCRDR currently holds 38,733 measurements from 18 data contributors. The repository continues to grow with new measurements from existing contributors and the addition of new contributors. A prominent use of the BCRDR was to create the online, interactive BC Radon Map, which includes regional concentration summaries, risk interpretation messaging, and health promotion information. Anonymized BCRDR data are also available for external release upon request. IMPLICATIONS: The BCCDC leverages existing radon measurement programs to create a large and integrated database with wide geographic coverage. The development and application of the BCRDR informs public health research and action beyond the BCCDC, and the repository can serve as a model for other regional or national initiatives.


RéSUMé: LIEU: Le potentiel d'exposition au radon à l'intérieur des bâtiments varie beaucoup d'une région à l'autre de la Colombie-Britannique en raison de la géologie variée. Les particuliers peuvent avoir du mal à comprendre leur risque d'exposition, et les organismes, à comprendre l'utilité des programmes et des politiques populationnels pour atténuer le risque. INTERVENTION: Le BC Centre for Disease Control (« le Centre ¼) a créé un organe d'archivage, le BC Radon Data Repository (BCRDR), pour faciliter la recherche, l'information, la sensibilisation du public et l'action liées au radon dans la province. Le BCRDR totalise les relevés du radon à l'intérieur des bâtiments pris par les organismes gouvernementaux, les professionnels et les organismes de l'industrie, ainsi que les groupes de recherche et de revendication. La participation est officialisée par un accord de partage de données qui décrit comment le Centre anonymise et gère les données communes du BCRDR. RéSULTATS: Le BCRDR contient actuellement 38 733 relevés de 18 contributeurs de données. Il continue de croître, avec de nouveaux relevés venant de contributeurs existants et l'ajout de nouveaux contributeurs. Il a servi, entre autres, à créer une carte du radon interactive en ligne pour la Colombie-Britannique, avec des résumés des concentrations régionales, des messages d'interprétation du risque et des informations de promotion de la santé. Sur demande, les données anonymisées du BCRDR sont également disponibles pour diffusion externe. CONSéQUENCES: Le Centre a exploité les programmes de prise de relevés du radon existants pour créer une grande base de données intégrée ayant une vaste couverture géographique. Le développement et les applications du BCRDR éclairent la recherche et l'action en santé publique au-delà du Centre, et l'organe d'archivage peut servir de modèle pour d'autres initiatives régionales ou nationales.


Assuntos
Saúde Pública , Radônio , Poluição do Ar em Ambientes Fechados/prevenção & controle , Colúmbia Britânica/epidemiologia , Bases de Dados Factuais , Comunicação em Saúde/métodos , Disseminação de Informação , Fonte de Informação
20.
Dement Neurocogn Disord ; 23(3): 127-135, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39113754

RESUMO

Background and Purpose: To ensure data privacy, the development of defacing processes, which anonymize brain images by obscuring facial features, is crucial. However, the impact of these defacing methods on brain imaging analysis poses significant concern. This study aimed to evaluate the reliability of three different defacing methods in automated brain volumetry. Methods: Magnetic resonance imaging with three-dimensional T1 sequences was performed on ten patients diagnosed with subjective cognitive decline. Defacing was executed using mri_deface, BioImage Suite Web-based defacing, and Defacer. Brain volumes were measured employing the QBraVo program and FreeSurfer, assessing intraclass correlation coefficient (ICC) and the mean differences in brain volume measurements between the original and defaced images. Results: The mean age of the patients was 71.10±6.17 years, with 4 (40.0%) being male. The total intracranial volume, total brain volume, and ventricle volume exhibited high ICCs across the three defacing methods and 2 volumetry analyses. All regional brain volumes showed high ICCs with all three defacing methods. Despite variations among some brain regions, no significant mean differences in regional brain volume were observed between the original and defaced images across all regions. Conclusions: The three defacing algorithms evaluated did not significantly affect the results of image analysis for the entire brain or specific cerebral regions. These findings suggest that these algorithms can serve as robust methods for defacing in neuroimaging analysis, thereby supporting data anonymization without compromising the integrity of brain volume measurements.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA