RESUMEN
Multisite medical data sharing is critical in modern clinical practice and medical research. The challenge is to conduct data sharing that preserves individual privacy and data utility. The shortcomings of traditional privacy-enhancing technologies mean that institutions rely upon bespoke data sharing contracts. The lengthy process and administration induced by these contracts increases the inefficiency of data sharing and may disincentivize important clinical treatment and medical research. This paper provides a synthesis between 2 novel advanced privacy-enhancing technologies-homomorphic encryption and secure multiparty computation (defined together as multiparty homomorphic encryption). These privacy-enhancing technologies provide a mathematical guarantee of privacy, with multiparty homomorphic encryption providing a performance advantage over separately using homomorphic encryption or secure multiparty computation. We argue multiparty homomorphic encryption fulfills legal requirements for medical data sharing under the European Union's General Data Protection Regulation which has set a global benchmark for data protection. Specifically, the data processed and shared using multiparty homomorphic encryption can be considered anonymized data. We explain how multiparty homomorphic encryption can reduce the reliance upon customized contractual measures between institutions. The proposed approach can accelerate the pace of medical research while offering additional incentives for health care and research institutes to employ common data interoperability standards.
Asunto(s)
Seguridad Computacional/ética , Difusión de la Información/ética , Privacidad/legislación & jurisprudencia , Tecnología/métodos , HumanosRESUMEN
In clinical genomics, the continuous evolution of bioinformatic algorithms and sequencing platforms makes it beneficial to store patients' complete aligned genomic data in addition to variant calls relative to a reference sequence. Due to the large size of human genome sequence data files (varying from 30 GB to 200 GB depending on coverage), two major challenges facing genomics laboratories are the costs of storage and the efficiency of the initial data processing. In addition, privacy of genomic data is becoming an increasingly serious concern, yet no standard data storage solutions exist that enable compression, encryption, and selective retrieval. Here we present a privacy-preserving solution named SECRAM (Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map) for the secure storage of compressed aligned genomic data. Our solution enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared with BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared with CRAM, one of the most compressed nonencrypted formats (using 34% less storage than BAM), SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage. Compared with previous work, the distinguishing features of SECRAM are that (1) it is position-based instead of read-based, and (2) it allows random querying of a subregion from a BAM-like file in an encrypted form. Our method thus offers a space-saving, privacy-preserving, and effective solution for the storage of clinical genomic data.
Asunto(s)
Compresión de Datos/métodos , Genómica/métodos , Almacenamiento y Recuperación de la Información/métodos , Algoritmos , Biología Computacional/métodos , Seguridad Computacional/normas , Privacidad Genética , Genoma Humano , HumanosRESUMEN
MOTIVATION: Due to the limited power of small-scale genome-wide association studies (GWAS), researchers tend to collaborate and establish a larger consortium in order to perform large-scale GWAS. Genome-wide association meta-analysis (GWAMA) is a statistical tool that aims to synthesize results from multiple independent studies to increase the statistical power and reduce false-positive findings of GWAS. However, it has been demonstrated that the aggregate data of individual studies are subject to inference attacks, hence privacy concerns arise when researchers share study data in GWAMA. RESULTS: In this article, we propose a secure quality control (SQC) protocol, which enables checking the quality of data in a privacy-preserving way without revealing sensitive information to a potential adversary. SQC employs state-of-the-art cryptographic and statistical techniques for privacy protection. We implement the solution in a meta-analysis pipeline with real data to demonstrate the efficiency and scalability on commodity machines. The distributed execution of SQC on a cluster of 128 cores for one million genetic variants takes less than one hour, which is a modest cost considering the 10-month time span usually observed for the completion of the QC procedure that includes timing of logistics. AVAILABILITY AND IMPLEMENTATION: SQC is implemented in Java and is publicly available at https://github.com/acs6610987/secureqc. CONTACT: jean-pierre.hubaux@epfl.ch. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Confidencialidad , Estudio de Asociación del Genoma Completo/métodos , Metaanálisis como Asunto , Control de Calidad , Estudio de Asociación del Genoma Completo/normas , HumanosRESUMEN
PURPOSE: Protecting patient privacy is a major obstacle for the implementation of genomic-based medicine. Emerging privacy-enhancing technologies can become key enablers for managing sensitive genetic data. We studied physicians' attitude toward this kind of technology in order to derive insights that might foster their future adoption for clinical care. METHODS: We conducted a questionnaire-based survey among 55 physicians of the Swiss HIV Cohort Study who tested the first implementation of a privacy-preserving model for delivering genomic test results. We evaluated their feedback on three different aspects of our model: clinical utility, ability to address privacy concerns and system usability. RESULTS: 38/55 (69%) physicians participated in the study. Two thirds of them acknowledged genetic privacy as a key aspect that needs to be protected to help building patient trust and deploy new-generation medical information systems. All of them successfully used the tool for evaluating their patients' pharmacogenomics risk and 90% were happy with the user experience and the efficiency of the tool. Only 8% of physicians were unsatisfied with the level of information and wanted to have access to the patient's actual DNA sequence. CONCLUSION: This survey, although limited in size, represents the first evaluation of privacy-preserving models for genomic-based medicine. It has allowed us to derive unique insights that will improve the design of these new systems in the future. In particular, we have observed that a clinical information system that uses homomorphic encryption to provide clinicians with risk information based on sensitive genetic test results can offer information that clinicians feel sufficient for their needs and appropriately respectful of patients' privacy. The ability of this kind of systems to ensure strong security and privacy guarantees and to provide some analytics on encrypted data has been assessed as a key enabler for the management of sensitive medical information in the near future. Providing clinically relevant information to physicians while protecting patients' privacy in order to comply with regulations is crucial for the widespread use of these new technologies.
Asunto(s)
Seguridad Computacional , Confidencialidad , Infecciones por VIH/genética , Infecciones por VIH/terapia , Adulto , Anciano , Estudios de Cohortes , Registros Electrónicos de Salud , Femenino , Variación Genética , Genómica , Genotipo , Humanos , Masculino , Informática Médica , Persona de Mediana Edad , Médicos , Programas Informáticos , Encuestas y Cuestionarios , Suiza , Interfaz Usuario-ComputadorRESUMEN
PURPOSE: The implementation of genomic-based medicine is hindered by unresolved questions regarding data privacy and delivery of interpreted results to health-care practitioners. We used DNA-based prediction of HIV-related outcomes as a model to explore critical issues in clinical genomics. METHODS: We genotyped 4,149 markers in HIV-positive individuals. Variants allowed for prediction of 17 traits relevant to HIV medical care, inference of patient ancestry, and imputation of human leukocyte antigen (HLA) types. Genetic data were processed under a privacy-preserving framework using homomorphic encryption, and clinical reports describing potentially actionable results were delivered to health-care providers. RESULTS: A total of 230 patients were included in the study. We demonstrated the feasibility of encrypting a large number of genetic markers, inferring patient ancestry, computing monogenic and polygenic trait risks, and reporting results under privacy-preserving conditions. The average execution time of a multimarker test on encrypted data was 865 ms on a standard computer. The proportion of tests returning potentially actionable genetic results ranged from 0 to 54%. CONCLUSIONS: The model of implementation presented herein informs on strategies to deliver genomic test results for clinical care. Data encryption to ensure privacy helps to build patient trust, a key requirement on the road to genomic-based medicine.Genet Med 18 8, 814-822.
Asunto(s)
Seguridad Computacional , Privacidad Genética , Infecciones por VIH/genética , Variación Genética , Genómica/ética , Humanos , Modelos TeóricosRESUMEN
Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward.
RESUMEN
Principal component analysis (PCA) is an essential algorithm for dimensionality reduction in many data science domains. We address the problem of performing a federated PCA on private data distributed among multiple data providers while ensuring data confidentiality. Our solution, SF-PCA, is an end-to-end secure system that preserves the confidentiality of both the original data and all intermediate results in a passive-adversary model with up to all-but-one colluding parties. SF-PCA jointly leverages multiparty homomorphic encryption, interactive protocols, and edge computing to efficiently interleave computations on local cleartext data with operations on collectively encrypted data. SF-PCA obtains results as accurate as non-secure centralized solutions, independently of the data distribution among the parties. It scales linearly or better with the dataset dimensions and with the number of data providers. SF-PCA is more precise than existing approaches that approximate the solution by combining local analysis results, and between 3x and 250x faster than privacy-preserving alternatives based solely on secure multiparty computation or homomorphic encryption. Our work demonstrates the practical applicability of secure and federated PCA on private distributed datasets.
RESUMEN
Training accurate and robust machine learning models requires a large amount of data that is usually scattered across data silos. Sharing or centralizing the data of different healthcare institutions is, however, unfeasible or prohibitively difficult due to privacy regulations. In this work, we address this problem by using a privacy-preserving federated learning-based approach, PriCell, for complex models such as convolutional neural networks. PriCell relies on multiparty homomorphic encryption and enables the collaborative training of encrypted neural networks with multiple healthcare institutions. We preserve the confidentiality of each institutions' input data, of any intermediate values, and of the trained model parameters. We efficiently replicate the training of a published state-of-the-art convolutional neural network architecture in a decentralized and privacy-preserving manner. Our solution achieves an accuracy comparable with the one obtained with the centralized non-secure solution. PriCell guarantees patient privacy and ensures data utility for efficient multi-center studies involving complex healthcare data.
RESUMEN
Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. In this work, we propose a framework that verifies the correctness of the aggregate statistics obtained as a result of a genome-wide association study (GWAS) conducted by a researcher while protecting individuals' privacy in the researcher's dataset. In GWAS, the goal of the researcher is to identify highly associated point mutations (variants) with a given phenotype. The researcher publishes the workflow of the conducted study, its output, and associated metadata. They keep the research dataset private while providing, as part of the metadata, a partial noisy dataset (that achieves local differential privacy). To check the correctness of the workflow output, a verifier makes use of the workflow, its metadata, and results of another GWAS (conducted using publicly available datasets) to distinguish between correct statistics and incorrect ones. For evaluation, we use real genomic data and show that the correctness of the workflow output can be verified with high accuracy even when the aggregate statistics of a small number of variants are provided. We also quantify the privacy leakage due to the provided workflow and its associated metadata and show that the additional privacy risk due to the provided metadata does not increase the existing privacy risk due to sharing of the research results. Thus, our results show that the workflow output (i.e., research results) can be verified with high confidence in a privacy-preserving way. We believe that this work will be a valuable step towards providing provenance in a privacy-preserving way while providing guarantees to the users about the correctness of the results.
RESUMEN
Using real-world evidence in biomedical research, an indispensable complement to clinical trials, requires access to large quantities of patient data that are typically held separately by multiple healthcare institutions. We propose FAMHE, a novel federated analytics system that, based on multiparty homomorphic encryption (MHE), enables privacy-preserving analyses of distributed datasets by yielding highly accurate results without revealing any intermediate data. We demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics. Using our system, we accurately and efficiently reproduce two published centralized studies in a federated setting, enabling biomedical insights that are not possible from individual institutions alone. Our work represents a necessary key step towards overcoming the privacy hurdle in enabling multi-centric scientific collaborations.
Asunto(s)
Medicina de Precisión , Privacidad , Algoritmos , Seguridad Computacional , Atención a la Salud , Estudio de Asociación del Genoma Completo , Humanos , Estimación de Kaplan-Meier , Análisis de SupervivenciaRESUMEN
The growing number of health-data breaches, the use of genomic databases for law enforcement purposes and the lack of transparency of personal genomics companies are raising unprecedented privacy concerns. To enable a secure exploration of genomic datasets with controlled and transparent data access, we propose a citizen-centric approach that combines cryptographic privacy-preserving technologies, such as homomorphic encryption and secure multi-party computation, with the auditability of blockchains. Our open-source implementation supports queries on the encrypted genomic data of hundreds of thousands of individuals, with minimal overhead. We show that real-world adoption of our system alleviates widespread privacy concerns and encourages data access sharing with researchers.
RESUMEN
Genotype imputation is a fundamental step in genomic data analysis, where missing variant genotypes are predicted using the existing genotypes of nearby "tag" variants. Although researchers can outsource genotype imputation, privacy concerns may prohibit genetic data sharing with an untrusted imputation service. Here, we developed secure genotype imputation using efficient homomorphic encryption (HE) techniques. In HE-based methods, the genotype data are secure while it is in transit, at rest, and in analysis. It can only be decrypted by the owner. We compared secure imputation with three state-of-the-art non-secure methods and found that HE-based methods provide genetic data security with comparable accuracy for common variants. HE-based methods have time and memory requirements that are comparable or lower than those for the non-secure methods. Our results provide evidence that HE-based methods can practically perform resource-intensive computations for high-throughput genetic data analysis. The source code is freely available for download at https://github.com/K-miran/secure-imputation.
Asunto(s)
Servicios Externos , Seguridad Computacional , Estudio de Asociación del Genoma Completo , Genotipo , PrivacidadRESUMEN
Medical studies are usually time consuming, cumbersome and extremely costly to perform, and for exploratory research, their results are also difficult to predict a priori. This is particularly the case for rare diseases, for which finding enough patients is difficult and usually requires an international-scale research. In this case, the process can be even more difficult due to the heterogeneity of data-protection regulations, making the data sharing process particularly hard. In this short paper, we propose MedCo2 (pronounced MedCo square), a distributed system that streamlines the process of a medical study by bridging and enabling both data discovery and data analysis among multiple databases, while protecting data confidentiality and patients' privacy. MedCo2 relies on interactive protocols, homomorphic encryption and differential privacy. It enables the privacy-preserving computations of multiple statistics such as cosine similarity and variance, and the training of machine learning models, on patients that are obliviously selected according to specific criteria among multiple databases.
Asunto(s)
Privacidad , Estudios de Cohortes , Seguridad Computacional , Confidencialidad , Humanos , Aprendizaje AutomáticoRESUMEN
Personalised medicine can improve both public and individual health by providing targeted preventative and therapeutic healthcare. However, patient health data must be shared between institutions and across jurisdictions for the benefits of personalised medicine to be realised. Whilst data protection, privacy, and research ethics laws protect patient confidentiality and safety they also may impede multisite research, particularly across jurisdictions. Accordingly, we compare the concept of data accessibility in data protection and research ethics laws across seven jurisdictions. These jurisdictions include Switzerland, Italy, Spain, the United Kingdom (which have implemented the General Data Protection Regulation), the United States, Canada, and Australia. Our paper identifies the requirements for consent, the standards for anonymisation or pseudonymisation, and adequacy of protection between jurisdictions as barriers for sharing. We also identify differences between the European Union and other jurisdictions as a significant barrier for data accessibility in cross jurisdictional multisite research. Our paper concludes by considering solutions to overcome these legislative differences. These solutions include data transfer agreements and organisational collaborations designed to `front load' the process of ethics approval, so that subsequent research protocols are standardised. We also allude to technical solutions, such as distributed computing, secure multiparty computation and homomorphic encryption.
RESUMEN
One major obstacle to developing precision medicine to its full potential is the privacy concerns related to genomic-data sharing. Even though the academic community has proposed many solutions to protect genomic privacy, these so far have not been adopted in practice, mainly due to their impact on the data utility. We introduce GenoShare, a framework that enables individual citizens to understand and quantify the risks of revealing genome-related privacy-sensitive attributes (e.g., health status, kinship, physical traits) from sharing their genomic data with (potentially untrusted) third parties. GenoShare enables informed decision-making about sharing exact genomic data, by jointly simulating genome-based inference attacks and quantifying the risk stemming from a potential data disclosure.
Asunto(s)
Bases de Datos Genéticas/ética , Privacidad Genética , Genómica/ética , Difusión de la Información/ética , Consentimiento Informado , Confidencialidad , Revelación , Genoma , Humanos , Registro Médico CoordinadoRESUMEN
MedCo is the first operational system that makes sensitive medical-data available for research in a simple, privacy-conscious and secure way. It enables a consortium of clinical sites to collectively protect their data and to securely share them with investigators, without single points of failure. In this short paper, we report on our ongoing effort for the operational deployment of MedCo within the context of the Swiss Personalized Health Network (SPHN) for the Swiss Molecular Tumor Board.
Asunto(s)
Neoplasias , Privacidad , Seguridad Computacional , Confidencialidad , Registros Electrónicos de Salud , Humanos , Poder Psicológico , SuizaRESUMEN
The increasing number of health-data breaches is creating a complicated environment for medical-data sharing and, consequently, for medical progress. Therefore, the development of new solutions that can reassure clinical sites by enabling privacy-preserving sharing of sensitive medical data in compliance with stringent regulations (e.g., HIPAA, GDPR) is now more urgent than ever. In this work, we introduce MedCo, the first operational system that enables a group of clinical sites to federate and collectively protect their data in order to share them with external investigators without worrying about security and privacy concerns. MedCo uses (a) collective homomorphic encryption to provide trust decentralization and end-to-end confidentiality protection, and (b) obfuscation techniques to achieve formal notions of privacy, such as differential privacy. A critical feature of MedCo is that it is fully integrated within the i2b2 (Informatics for Integrating Biology and the Bedside) framework, currently used in more than 300 hospitals worldwide. Therefore, it is easily adoptable by clinical sites. We demonstrate MedCo's practicality by testing it on data from The Cancer Genome Atlas in a simulated network of three institutions. Its performance is comparable to the ones of SHRINE (networked i2b2), which, in contrast, does not provide any data protection guarantee.
Asunto(s)
Seguridad Computacional , Registros Electrónicos de Salud , Genómica , Informática Médica/métodos , Algoritmos , Confidencialidad , Genoma Humano , Hospitales , Humanos , Internet , Mutación , Neoplasias/genética , Proteínas Proto-Oncogénicas B-raf/genética , Programas InformáticosRESUMEN
The biomedical community is lagging in the adoption of cloud computing for the management of medical data. The primary obstacles are concerns about privacy and security. In this paper, we explore the feasibility of using advanced privacy-enhancing technologies in order to enable the sharing of sensitive clinical data in a public cloud. Our goal is to facilitate sharing of clinical data in the cloud by minimizing the risk of unintended leakage of sensitive clinical information. In particular, we focus on homomorphic encryption, a specific type of encryption that offers the ability to run computation on the data while the data remains encrypted. This paper demonstrates that homomorphic encryption can be used efficiently to compute aggregating queries on the ciphertexts, along with providing end-to-end confidentiality of aggregate-level data from the i2b2 data model.
RESUMEN
Re-use of patients' health records can provide tremendous benefits for clinical research. Yet, when researchers need to access sensitive/identifying data, such as genomic data, in order to compile cohorts of well-characterized patients for specific studies, privacy and security concerns represent major obstacles that make such a procedure extremely difficult if not impossible. In this paper, we address the challenge of designing and deploying in a real operational setting an efficient privacy-preserving explorer for genetic cohorts. Our solution is built on top of the i2b2 (Informatics for Integrating Biology and the Bedside) framework and leverages cutting-edge privacy-enhancing technologies such as homomorphic encryption and differential privacy. Solutions involving homomorphic encryption are often believed to be costly and immature for use in operational environments. Here, we show that, for specific applications, homomorphic encryption is actually a very efficient enabler. Indeed, our solution outperforms prior work by enabling a researcher to securely compute simple statistics on more than 3,000 encrypted genetic variants simultaneously for a cohort of 5,000 individuals in less than 5 seconds with commodity hardware. To the best of our knowledge, our privacy-preserving solution is the first to also be successfully deployed and tested in a operation setting (Lausanne University Hospital).
Asunto(s)
Seguridad Computacional/normas , Registros Electrónicos de Salud , Privacidad Genética/normas , Genómica , Computación en Informática Médica , HumanosRESUMEN
BACKGROUND: Cloud computing is becoming the preferred solution for efficiently dealing with the increasing amount of genomic data. Yet, outsourcing storage and processing sensitive information, such as genomic data, comes with important concerns related to privacy and security. This calls for new sophisticated techniques that ensure data protection from untrusted cloud providers and that still enable researchers to obtain useful information. METHODS: We present a novel privacy-preserving algorithm for fully outsourcing the storage of large genomic data files to a public cloud and enabling researchers to efficiently search for variants of interest. In order to protect data and query confidentiality from possible leakage, our solution exploits optimal encoding for genomic variants and combines it with homomorphic encryption and private information retrieval. Our proposed algorithm is implemented in C++ and was evaluated on real data as part of the 2016 iDash Genome Privacy-Protection Challenge. RESULTS: Results show that our solution outperforms the state-of-the-art solutions and enables researchers to search over millions of encrypted variants in a few seconds. CONCLUSIONS: As opposed to prior beliefs that sophisticated privacy-enhancing technologies (PETs) are unpractical for real operational settings, our solution demonstrates that, in the case of genomic data, PETs are very efficient enablers.