RESUMEN
Precision medicine relies on molecular and systems biology methods as well as bidirectional association studies of phenotypes and (high-throughput) genomic data. However, the integrated use of such data often faces obstacles, especially in regards to data protection. An important prerequisite for research data processing is usually informed consent. But collecting consent is not always feasible, in particular when data are to be analyzed retrospectively. For phenotype data, anonymization, i.e. the altering of data in such a way that individuals cannot be identified, can provide an alternative. Several re-identification attacks have shown that this is a complex task and that simply removing directly identifying attributes such as names is usually not enough. More formal approaches are needed that use mathematical models to quantify risks and guide their reduction. Due to the complexity of these techniques, it is challenging and not advisable to implement them from scratch. Open software libraries and tools can provide a robust alternative. However, also the range of available anonymization tools is heterogeneous and obtaining an overview of their strengths and weaknesses is difficult due to the complexity of the problem space. We therefore performed a systematic review of open anonymization tools for structured phenotype data described in the literature between 1990 and 2021. Through a two-step eligibility assessment process, we selected 13 tools for an in-depth analysis. By comparing the supported anonymization techniques and further aspects, such as maturity, we derive recommendations for tools to use for anonymizing phenotype datasets with different properties.
Asunto(s)
Investigación Biomédica , Privacidad , Estudios Retrospectivos , Anonimización de la Información , FenotipoRESUMEN
BACKGROUND: Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set's statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice. OBJECTIVE: The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study. METHODS: The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case-specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results. RESULTS: Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy. CONCLUSIONS: Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case-specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data. TRIAL REGISTRATION: German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1093/ndt/gfr456.
Asunto(s)
Anonimización de la Información , Humanos , Insuficiencia Renal Crónica/terapia , Difusión de la Información/métodos , Algoritmos , Alemania , Confidencialidad , PrivacidadRESUMEN
The increasing digitization of the healthcare system is leading to a growing volume of health data. Leveraging this data beyond its initial collection purpose for secondary use can provide valuable insights into diagnostics, treatment processes, and the quality of care. The Health Data Lab (HDL) will provide infrastructure for this purpose. Both the protection of patient privacy and optimal analytical capabilities are of central importance in this context, and artificial intelligence (AI) provides two opportunities. First, it enables the analysis of large volumes of data with flexible models, which means that hidden correlations and patterns can be discovered. Second, synthetic - that is, artificial - data generated by AI can protect privacy.This paper describes the KI-FDZ project, which aims to investigate innovative technologies that can support the secure provision of health data for secondary research purposes. A multi-layered approach is investigated in which data-level measures can be combined in different ways with processing in secure environments. To this end, anonymization and synthetization methods, among others, are evaluated based on two concrete application examples. Moreover, it is examined how the creation of machine learning pipelines and the execution of AI algorithms can be supported in secure processing environments. Preliminary results indicate that this approach can achieve a high level of protection while maintaining data validity. The approach investigated in the project can be an important building block in the secure secondary use of health data.
Asunto(s)
Algoritmos , Inteligencia Artificial , Humanos , Alemania , Atención a la SaludRESUMEN
Healthcare data are an important resource in applied medical research. They are available multicentrically. However, it remains a challenge to enable standardized data exchange processes between federal states and their individual laws and regulations. The Medical Informatics Initiative (MII) was founded in 2016 to implement processes that enable cross-clinic access to healthcare data in Germany. Several working groups (WGs) have been set up to coordinate standardized data structures (WG Interoperability), patient information and declarations of consent (WG Consent), and regulations on data exchange (WG Data Sharing). Here we present the most important results of the Data Sharing working group, which include agreed terms of use, legal regulations, and data access processes. They are already being implemented by the established Data Integration Centers (DIZ) and Use and Access Committees (UACs). We describe the services that are necessary to provide researchers with standardized data access. They are implemented with the Research Data Portal for Health, among others. Since the pilot phase, the processes of 385 active researchers have been used on this basis, which, as of April 2024, has resulted in 19 registered projects and 31 submitted research applications.
Asunto(s)
Registros Electrónicos de Salud , Difusión de la Información , Humanos , Investigación Biomédica , Registros Electrónicos de Salud/estadística & datos numéricos , Alemania , Investigación sobre Servicios de Salud , Informática Médica , Registro Médico Coordinado/métodos , Modelos OrganizacionalesRESUMEN
BACKGROUND: The digitalization in the healthcare sector promises a secondary use of patient data in the sense of a learning healthcare system. For this, the Medical Informatics Initiative's (MII) Consent Working Group has created an ethical and legal basis with standardized consent documents. This paper describes the systematically monitored introduction of these documents at the MII sites. METHODS: The monitoring of the introduction included regular online surveys, an in-depth analysis of the introduction processes at selected sites, and an assessment of the documents in use. In addition, inquiries and feedback from a large number of stakeholders were evaluated. RESULTS: The online surveys showed that 27 of the 32 sites have gradually introduced the consent documents productively, with a current total of 173,289 consents. The analysis of the implementation procedures revealed heterogeneous organizational conditions at the sites. The requirements of various stakeholders were met by developing and providing supplementary versions of the consent documents and additional information materials. DISCUSSION: The introduction of the MII consent documents at the university hospitals creates a uniform legal basis for the secondary use of patient data. However, the comprehensive implementation within the sites remains challenging. Therefore, minimum requirements for patient information and supplementary recommendations for best practice must be developed. The further development of the national legal framework for research will not render the participation and transparency mechanisms developed here obsolete.
Asunto(s)
Consentimiento Informado , Alemania , Consentimiento Informado/legislación & jurisprudencia , Consentimiento Informado/normas , Humanos , Registros Electrónicos de Salud/legislación & jurisprudencia , Registros Electrónicos de Salud/normas , Formularios de Consentimiento/normas , Formularios de Consentimiento/legislación & jurisprudencia , Programas Nacionales de Salud/legislación & jurisprudenciaRESUMEN
Healthcare-associated infections (HCAIs) represent an enormous burden for patients, healthcare workers, relatives and society worldwide, including Germany. The central tasks of infection prevention are recording and evaluating infections with the aim of identifying prevention potential and risk factors, taking appropriate measures and finally evaluating them. From an infection prevention perspective, it would be of great value if (i) the recording of infection cases was automated and (ii) if it were possible to identify particularly vulnerable patients and patient groups in advance, who would benefit from specific and/or additional interventions.To achieve this risk-adapted, individualized infection prevention, the RISK PRINCIPE research project develops algorithms and computer-based applications based on standardised, large datasets and incorporates expertise in the field of infection prevention.The project has two objectives: a) to develop and validate a semi-automated surveillance system for hospital-acquired bloodstream infections, prototypically for HCAI, and b) to use comprehensive patient data from different sources to create an individual or group-specific infection risk profile.RISK PRINCIPE is based on bringing together the expertise of medical informatics and infection medicine with a focus on hygiene and draws on information and experience from two consortia (HiGHmed and SMITH) of the German Medical Informatics Initiative (MII), which have been working on use cases in infection medicine for more than five years.
Asunto(s)
Infección Hospitalaria , Humanos , Algoritmos , Infección Hospitalaria/prevención & control , Infección Hospitalaria/epidemiología , Alemania/epidemiología , Control de Infecciones/métodos , Control de Infecciones/normas , Vigilancia de la Población/métodos , Medición de Riesgo/métodos , Factores de RiesgoRESUMEN
The interoperability Working Group of the Medical Informatics Initiative (MII) is the platform for the coordination of overarching procedures, data structures, and interfaces between the data integration centers (DIC) of the university hospitals and national and international interoperability committees. The goal is the joint content-related and technical design of a distributed infrastructure for the secondary use of healthcare data that can be used via the Research Data Portal for Health. Important general conditions are data privacy and IT security for the use of health data in biomedical research. To this end, suitable methods are used in dedicated task forces to enable procedural, syntactic, and semantic interoperability for data use projects. The MII core dataset was developed as several modules with corresponding information models and implemented using the HL7® FHIR® standard to enable content-related and technical specifications for the interoperable provision of healthcare data through the DIC. International terminologies and consented metadata are used to describe these data in more detail. The overall architecture, including overarching interfaces, implements the methodological and legal requirements for a distributed data use infrastructure, for example, by providing pseudonymized data or by federated analyses. With these results of the Interoperability Working Group, the MII is presenting a future-oriented solution for the exchange and use of healthcare data, the applicability of which goes beyond the purpose of research and can play an essential role in the digital transformation of the healthcare system.
Asunto(s)
Interoperabilidad de la Información en Salud , Humanos , Conjuntos de Datos como Asunto , Registros Electrónicos de Salud , Alemania , Interoperabilidad de la Información en Salud/normas , Informática Médica , Registro Médico Coordinado/métodos , Integración de SistemasRESUMEN
PURPOSE: Patients suffering from chronic kidney disease (CKD) are in general at high risk for severe coronavirus disease (COVID-19) but dialysis-dependency (CKD5D) is poorly understood. We aimed to describe CKD5D patients in the different intervals of the pandemic and to evaluate pre-existing dialysis dependency as a potential risk factor for mortality. METHODS: In this multicentre cohort study, data from German study sites of the Lean European Open Survey on SARS-CoV-2-infected patients (LEOSS) were used. We multiply imputed missing data, performed subsequent analyses in each of the imputed data sets and pooled the results. Cases (CKD5D) and controls (CKD not requiring dialysis) were matched 1:1 by propensity-scoring. Effects on fatal outcome were calculated by multivariable logistic regression. RESULTS: The cohort consisted of 207 patients suffering from CKD5D and 964 potential controls. Multivariable regression of the whole cohort identified age (> 85 years adjusted odds ratio (aOR) 7.34, 95% CI 2.45-21.99), chronic heart failure (aOR 1.67, 95% CI 1.25-2.23), coronary artery disease (aOR 1.41, 95% CI 1.05-1.89) and active oncological disease (aOR 1.73, 95% CI 1.07-2.80) as risk factors for fatal outcome. Dialysis-dependency was not associated with a fatal outcome-neither in this analysis (aOR 1.08, 95% CI 0.75-1.54) nor in the conditional multivariable regression after matching (aOR 1.34, 95% CI 0.70-2.59). CONCLUSIONS: In the present multicentre German cohort, dialysis dependency is not linked to fatal outcome in SARS-CoV-2-infected CKD patients. However, the mortality rate of 26% demonstrates that CKD patients are an extreme vulnerable population, irrespective of pre-existing dialysis-dependency.
Asunto(s)
COVID-19 , Insuficiencia Renal Crónica , Humanos , Anciano de 80 o más Años , COVID-19/epidemiología , SARS-CoV-2 , Estudios de Cohortes , Diálisis Renal , Pandemias , Insuficiencia Renal Crónica/complicaciones , Insuficiencia Renal Crónica/epidemiología , Insuficiencia Renal Crónica/terapia , Progresión de la EnfermedadRESUMEN
Effective and efficient privacy risk management (PRM) is a necessary condition to support digitalization in health care and secondary use of patient data in research. To reduce privacy risks, current PRM frameworks are rooted in an approach trying to reduce undesired technical/organizational outcomes such as broken encryption or unintentional data disclosure. Comparing this with risk management in preventive or therapeutic medicine, a key difference becomes apparent: in health-related risk management, medicine focuses on person-specific health outcomes, whereas PRM mostly targets more indirect, technical/organizational outcomes. In this paper, we illustrate and discuss how a PRM approach based on evidence of person-specific privacy outcomes might look using three consecutive steps: i) a specification of undesired person-specific privacy outcomes, ii) empirical assessments of their frequency and severity, and iii) empirical studies on how effectively the available PRM interventions reduce their frequency or severity. After an introduction of these three steps, we cover their status quo and outline open questions and PRM-specific challenges in need of further conceptual clarification and feasibility studies. Specific challenges of an outcome-oriented approach to PRM include the potential delays between concrete threats manifesting and the resulting person/group-specific privacy outcomes. Moreover, new ways of exploiting privacy-sensitive information to harm individuals could be developed in the future. The challenges described are of technical, legal, ethical, financial and resource-oriented nature. In health research, however, there is explicit discussion about how to overcome such challenges to make important outcome-based assessments as feasible as possible. This paper concludes that it might be the time to have this discussion in the PRM field as well.
Asunto(s)
Confidencialidad , Privacidad , HumanosRESUMEN
BACKGROUND: Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research. OBJECTIVE: The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption. METHODS: Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures. RESULTS: We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV. CONCLUSIONS: The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
Asunto(s)
Investigación Biomédica , Humanos , Metadatos , PubMed , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
BACKGROUND: Modern biomedical research is data-driven and relies heavily on the re-use and sharing of data. Biomedical data, however, is subject to strict data protection requirements. Due to the complexity of the data required and the scale of data use, obtaining informed consent is often infeasible. Other methods, such as anonymization or federation, in turn have their own limitations. Secure multi-party computation (SMPC) is a cryptographic technology for distributed calculations, which brings formally provable security and privacy guarantees and can be used to implement a wide-range of analytical approaches. As a relatively new technology, SMPC is still rarely used in real-world biomedical data sharing activities due to several barriers, including its technical complexity and lack of usability. RESULTS: To overcome these barriers, we have developed the tool EasySMPC, which is implemented in Java as a cross-platform, stand-alone desktop application provided as open-source software. The tool makes use of the SMPC method Arithmetic Secret Sharing, which allows to securely sum up pre-defined sets of variables among different parties in two rounds of communication (input sharing and output reconstruction) and integrates this method into a graphical user interface. No additional software services need to be set up or configured, as EasySMPC uses the most widespread digital communication channel available: e-mails. No cryptographic keys need to be exchanged between the parties and e-mails are exchanged automatically by the software. To demonstrate the practicability of our solution, we evaluated its performance in a wide range of data sharing scenarios. The results of our evaluation show that our approach is scalable (summing up 10,000 variables between 20 parties takes less than 300 s) and that the number of participants is the essential factor. CONCLUSIONS: We have developed an easy-to-use "no-code solution" for performing secure joint calculations on biomedical data using SMPC protocols, which is suitable for use by scientists without IT expertise and which has no special infrastructure requirements. We believe that innovative approaches to data sharing with SMPC are needed to foster the translation of complex protocols into practice.
Asunto(s)
Investigación Biomédica , Seguridad Computacional , Humanos , Difusión de la Información , Programas InformáticosRESUMEN
PURPOSE: The ongoing pandemic caused by the novel severe acute respiratory coronavirus 2 (SARS-CoV-2) has stressed health systems worldwide. Patients with chronic kidney disease (CKD) seem to be more prone to a severe course of coronavirus disease (COVID-19) due to comorbidities and an altered immune system. The study's aim was to identify factors predicting mortality among SARS-CoV-2-infected patients with CKD. METHODS: We analyzed 2817 SARS-CoV-2-infected patients enrolled in the Lean European Open Survey on SARS-CoV-2-infected patients and identified 426 patients with pre-existing CKD. Group comparisons were performed via Chi-squared test. Using univariate and multivariable logistic regression, predictive factors for mortality were identified. RESULTS: Comparative analyses to patients without CKD revealed a higher mortality (140/426, 32.9% versus 354/2391, 14.8%). Higher age could be confirmed as a demographic predictor for mortality in CKD patients (> 85 years compared to 15-65 years, adjusted odds ratio (aOR) 6.49, 95% CI 1.27-33.20, p = 0.025). We further identified markedly elevated lactate dehydrogenase (> 2 × upper limit of normal, aOR 23.21, 95% CI 3.66-147.11, p < 0.001), thrombocytopenia (< 120,000/µl, aOR 11.66, 95% CI 2.49-54.70, p = 0.002), anemia (Hb < 10 g/dl, aOR 3.21, 95% CI 1.17-8.82, p = 0.024), and C-reactive protein (≥ 30 mg/l, aOR 3.44, 95% CI 1.13-10.45, p = 0.029) as predictors, while renal replacement therapy was not related to mortality (aOR 1.15, 95% CI 0.68-1.93, p = 0.611). CONCLUSION: The identified predictors include routinely measured and universally available parameters. Their assessment might facilitate risk stratification in this highly vulnerable cohort as early as at initial medical evaluation for SARS-CoV-2.
Asunto(s)
COVID-19/complicaciones , COVID-19/mortalidad , Insuficiencia Renal Crónica/complicaciones , SARS-CoV-2 , Adolescente , Adulto , Anciano de 80 o más Años , Estudios de Cohortes , Comorbilidad , Humanos , Modelos Logísticos , Persona de Mediana Edad , Insuficiencia Renal Crónica/inmunología , Factores de Riesgo , Adulto JovenRESUMEN
BACKGROUND: Data sharing is considered a crucial part of modern medical research. Unfortunately, despite its advantages, it often faces obstacles, especially data privacy challenges. As a result, various approaches and infrastructures have been developed that aim to ensure that patients and research participants remain anonymous when data is shared. However, privacy protection typically comes at a cost, e.g. restrictions regarding the types of analyses that can be performed on shared data. What is lacking is a systematization making the trade-offs taken by different approaches transparent. The aim of the work described in this paper was to develop a systematization for the degree of privacy protection provided and the trade-offs taken by different data sharing methods. Based on this contribution, we categorized popular data sharing approaches and identified research gaps by analyzing combinations of promising properties and features that are not yet supported by existing approaches. METHODS: The systematization consists of different axes. Three axes relate to privacy protection aspects and were adopted from the popular Five Safes Framework: (1) safe data, addressing privacy at the input level, (2) safe settings, addressing privacy during shared processing, and (3) safe outputs, addressing privacy protection of analysis results. Three additional axes address the usefulness of approaches: (4) support for de-duplication, to enable the reconciliation of data belonging to the same individuals, (5) flexibility, to be able to adapt to different data analysis requirements, and (6) scalability, to maintain performance with increasing complexity of shared data or common analysis processes. RESULTS: Using the systematization, we identified three different categories of approaches: distributed data analyses, which exchange anonymous aggregated data, secure multi-party computation protocols, which exchange encrypted data, and data enclaves, which store pooled individual-level data in secure environments for access for analysis purposes. We identified important research gaps, including a lack of approaches enabling the de-duplication of horizontally distributed data or providing a high degree of flexibility. CONCLUSIONS: There are fundamental differences between different data sharing approaches and several gaps in their functionality that may be interesting to investigate in future work. Our systematization can make the properties of privacy-preserving data sharing infrastructures more transparent and support decision makers and regulatory authorities with a better understanding of the trade-offs taken.
Asunto(s)
Investigación Biomédica , Privacidad , Seguridad Computacional , Humanos , Difusión de la InformaciónRESUMEN
OBJECTIVE: The Coronavirus Disease-2019 (COVID-19) pandemic has brought opportunities and challenges, especially for health services research based on routine data. In this article we will demonstrate this by presenting lessons learned from establishing the currently largest registry in Germany providing a detailed clinical dataset on Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infected patients: the Lean European Open Survey on SARS-CoV-2 Infected Patients (LEOSS). METHODS: LEOSS is based on a collaborative and integrative research approach with anonymous recruitment and collection of routine data and the early provision of data in an open science context. The only requirement for inclusion was a SARS-CoV-2 infection confirmed by virological diagnosis. Crucial strategies to successfully realize the project included the dynamic reallocation of available staff and technical resources, an early and direct involvement of data protection experts and the ethics committee as well as the decision for an iterative and dynamic process of improvement and further development. RESULTS: Thanks to the commitment of numerous institutions, a transsectoral and transnational network of currently 133 actively recruiting sites with 7,227 documented cases could be established (status: 18.03.2021). Tools for data exploration on the project website, as well as the partially automated provision of datasets according to use cases with varying requirements, enabled us to utilize the data collected within a short period of time. Data use and access processes were carried out for 97 proposals assigned to 27 different research areas. So far, nine articles have been published in peer-reviewed international journals. CONCLUSION: As a collaborative effort of the whole network, LEOSS developed into a large collection of clinical data on COVID-19 in Germany. Even though in other international projects, much larger data sets could be analysed to investigate specific research questions through direct access to source systems, the uniformly maintained and technically verified documentation standard with many discipline-specific details resulted in a large valuable data set with unique characteristics. The lessons learned while establishing LEOSS during the current pandemic have already created important implications for the design of future registries and for pandemic preparedness and response.
Asunto(s)
COVID-19 , Pandemias , Alemania/epidemiología , Investigación sobre Servicios de Salud , Humanos , Pandemias/prevención & control , Sistema de Registros , SARS-CoV-2RESUMEN
BACKGROUND: Modern data driven medical research promises to provide new insights into the development and course of disease and to enable novel methods of clinical decision support. To realize this, machine learning models can be trained to make predictions from clinical, paraclinical and biomolecular data. In this process, privacy protection and regulatory requirements need careful consideration, as the resulting models may leak sensitive personal information. To counter this threat, a wide range of methods for integrating machine learning with formal methods of privacy protection have been proposed. However, there is a significant lack of practical tools to create and evaluate such privacy-preserving models. In this software article, we report on our ongoing efforts to bridge this gap. RESULTS: We have extended the well-known ARX anonymization tool for biomedical data with machine learning techniques to support the creation of privacy-preserving prediction models. Our methods are particularly well suited for applications in biomedicine, as they preserve the truthfulness of data (e.g. no noise is added) and they are intuitive and relatively easy to explain to non-experts. Moreover, our implementation is highly versatile, as it supports binomial and multinomial target variables, different types of prediction models and a wide range of privacy protection techniques. All methods have been integrated into a sound framework that supports the creation, evaluation and refinement of models through intuitive graphical user interfaces. To demonstrate the broad applicability of our solution, we present three case studies in which we created and evaluated different types of privacy-preserving prediction models for breast cancer diagnosis, diagnosis of acute inflammation of the urinary system and prediction of the contraceptive method used by women. In this process, we also used a wide range of different privacy models (k-anonymity, differential privacy and a game-theoretic approach) as well as different data transformation techniques. CONCLUSIONS: With the tool presented in this article, accurate prediction models can be created that preserve the privacy of individuals represented in the training set in a variety of threat scenarios. Our implementation is available as open source software.
Asunto(s)
Confidencialidad , Anonimización de la Información , Sistemas de Apoyo a Decisiones Clínicas , Modelos Estadísticos , Programas Informáticos , Investigación Biomédica , Humanos , Aprendizaje Automático , Curva ROC , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: The aim of the German Medical Informatics Initiative is to establish a national infrastructure for integrating and sharing health data. To this, Data Integration Centers are set up at university medical centers, which address data harmonization, information security and data protection. To capture patient consent, a common informed consent template has been developed. It consists of different modules addressing permissions for using data and biosamples. On the technical level, a common digital representation of information from signed consent templates is needed. As the partners in the initiative are free to adopt different solutions for managing consent information (e.g. IHE BPPC or HL7 FHIR Consent Resources), we had to develop an interoperability layer. METHODS: First, we compiled an overview of data items required to reflect the information from the MII consent template as well as patient preferences and derived permissions. Next, we created entity-relationship diagrams to formally describe the conceptual data model underlying relevant items. We then compared this data model to conceptual models describing representations of consent information using different interoperability standards. We used the result of this comparison to derive an interoperable representation that can be mapped to common standards. RESULTS: The digital representation needs to capture the following information: (1) version of the consent, (2) consent status for each module, and (3) period of validity of the status. We found that there is no generally accepted solution to represent status information in a manner interoperable with all relevant standards. Hence, we developed a pragmatic solution, comprising codes which describe combinations of modules with a basic set of status labels. We propose to maintain these codes in a public registry called ART-DECOR. We present concrete technical implementations of our approach using HL7 FHIR and IHE BPPC which are also compatible with the open-source consent management software gICS. CONCLUSIONS: The proposed digital representation is (1) generic enough to capture relevant information from a wide range of consent documents and data use regulations and (2) interoperable with common technical standards. We plan to extend our model to include more fine-grained status codes and rules for automated access control.
Asunto(s)
Seguridad Computacional , Consentimiento Informado , Informática Médica , Alemania , Humanos , Programas InformáticosRESUMEN
BACKGROUND: The collection of data and biospecimens which characterize patients and probands in-depth is a core element of modern biomedical research. Relevant data must be considered highly sensitive and it needs to be protected from unauthorized use and re-identification. In this context, laws, regulations, guidelines and best-practices often recommend or mandate pseudonymization, which means that directly identifying data of subjects (e.g. names and addresses) is stored separately from data which is primarily needed for scientific analyses. DISCUSSION: When (authorized) re-identification of subjects is not an exceptional but a common procedure, e.g. due to longitudinal data collection, implementing pseudonymization can significantly increase the complexity of software solutions. For example, data stored in distributed databases, need to be dynamically combined with each other, which requires additional interfaces for communicating between the various subsystems. This increased complexity may lead to new attack vectors for intruders. Obviously, this is in contrast to the objective of improving data protection. What is lacking is a standardized process of evaluating and reporting risks, threats and countermeasures, which can be used to test whether integrating pseudonymization methods into data collection systems actually improves upon the degree of protection provided by system designs that simply follow common IT security best practices and implement fine-grained role-based access control models. To demonstrate that the methods used to describe systems employing pseudonymized data management are currently heterogeneous and ad-hoc, we examined the extent to which twelve recent studies address each of the six basic security properties defined by the International Organization for Standardization (ISO) standard 27,000. We show inconsistencies across the studies, with most of them failing to mention one or more security properties. CONCLUSION: We discuss the degree of privacy protection provided by implementing pseudonymization into research data collection processes. We conclude that (1) more research is needed on the interplay of pseudonymity, information security and data protection, (2) problem-specific guidelines for evaluating and reporting risks, threats and countermeasures should be developed and that (3) future work on pseudonymized research data collection should include the results of such structured and integrated analyses.
Asunto(s)
Anónimos y Seudónimos , Investigación Biomédica , Confidencialidad , Redes de Comunicación de Computadores , Seguridad Computacional/normas , HumanosRESUMEN
BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically, de-identification methods search a solution space of possible data transformations to find a good solution to a given de-identification problem. In this process, parts of the search space must be excluded to maintain scalability. OBJECTIVES: The set of transformations which are solution candidates is typically narrowed down by storing the results obtained during the search process and then using them to predict properties of the output of other transformations in terms of privacy (first objective) and data quality (second objective). However, due to the exponential growth of the size of the search space, previous implementations of this method are not well-suited when datasets contain many attributes which need to be protected. As this is often the case with biomedical research data, e.g. as a result of longitudinal collection, we have developed a novel method. METHODS: Our approach combines the mathematical concept of antichains with a data structure inspired by prefix trees to represent properties of a large number of data transformations while requiring only a minimal amount of information to be stored. To analyze the improvements which can be achieved by adopting our method, we have integrated it into an existing algorithm and we have also implemented a simple best-first branch and bound search (BFS) algorithm as a first step towards methods which fully exploit our approach. We have evaluated these implementations with several real-world datasets and the k-anonymity privacy model. RESULTS: When integrated into existing de-identification algorithms for low-dimensional data, our approach reduced memory requirements by up to one order of magnitude and execution times by up to 25 %. This allowed us to increase the size of solution spaces which could be processed by almost a factor of 10. When using the simple BFS method, we were able to further increase the size of the solution space by a factor of three. When used as a heuristic strategy for high-dimensional data, the BFS approach outperformed a state-of-the-art algorithm by up to 12 % in terms of the quality of output data. CONCLUSIONS: This work shows that implementing methods of data de-identification for real-world applications is a challenging task. Our approach solves a problem often faced by data custodians: a lack of scalability of de-identification software when used with datasets having realistic schemas and volumes. The method described in this article has been implemented into ARX, an open source de-identification software for biomedical data.
Asunto(s)
Algoritmos , Confidencialidad , Informática Médica/métodos , Modelos Estadísticos , HumanosRESUMEN
OBJECTIVE: With the ARX data anonymization tool structured biomedical data can be de-identified using syntactic privacy models, such as k-anonymity. Data is transformed with two methods: (a) generalization of attribute values, followed by (b) suppression of data records. The former method results in data that is well suited for analyses by epidemiologists, while the latter method significantly reduces loss of information. Our tool uses an optimal anonymization algorithm that maximizes output utility according to a given measure. To achieve scalability, existing optimal anonymization algorithms exclude parts of the search space by predicting the outcome of data transformations regarding privacy and utility without explicitly applying them to the input dataset. These optimizations cannot be used if data is transformed with generalization and suppression. As optimal data utility and scalability are important for anonymizing biomedical data, we had to develop a novel method. METHODS: In this article, we first confirm experimentally that combining generalization with suppression significantly increases data utility. Next, we proof that, within this coding model, the outcome of data transformations regarding privacy and utility cannot be predicted. As a consequence, existing algorithms fail to deliver optimal data utility. We confirm this finding experimentally. The limitation of previous work can be overcome at the cost of increased computational complexity. However, scalability is important for anonymizing data with user feedback. Consequently, we identify properties of datasets that may be predicted in our context and propose a novel and efficient algorithm. Finally, we evaluate our solution with multiple datasets and privacy models. RESULTS: This work presents the first thorough investigation of which properties of datasets can be predicted when data is anonymized with generalization and suppression. Our novel approach adopts existing optimization strategies to our context and combines different search methods. The experiments show that our method is able to efficiently solve a broad spectrum of anonymization problems. CONCLUSION: Our work shows that implementing syntactic privacy models is challenging and that existing algorithms are not well suited for anonymizing data with transformation models which are more complex than generalization alone. As such models have been recommended for use in the biomedical domain, our results are of general relevance for de-identifying structured biomedical data.