Search | VHL Search Portal

Privacy preserving identification of population stratification for collaborative genomic research.

Dervishi, Leonard; Li, Wenbiao; Halimi, Anisa; Jiang, Xiaoqian; Vaidya, Jaideep; Ayday, Erman.

Bioinformatics ; 39(39 Suppl 1): i168-i176, 2023 06 30.

Article in English | MEDLINE | ID: mdl-37387172

ABSTRACT

The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators' datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants.

Subject(s)

Genomics , Privacy , Humans , Chromosome Mapping , Metadata , Principal Component Analysis

Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering.

Nedoshivina, Liubov; Halimi, Anisa; Bettencourt-Silva, Joao; Braghin, Stefano.

AMIA Jt Summits Transl Sci Proc ; 2024: 85-94, 2024.

Article in English | MEDLINE | ID: mdl-38827069

ABSTRACT

The volume of information, and in particular personal information, generated each day is increasing at a staggering rate. The ability to leverage such information depends greatly on being able to satisfy the many compliance and privacy regulations that are appearing all over the world. We present READI, a utility preserving framework for the unstructured document de-identification. READI leverages Named Entity Recognition and Relation Extraction technology to improve the quality of the entity detection, thus improving the overall quality of the data de-identification process. In this proof of concept study, we evaluate the proposed approach on two different datasets and compare with the existing state-of-the-art approaches. We show that Relation Extraction-based Approach for De-Identification (READI) notably reduces the number of false positives and improves the utility of the de-identified text.

Facilitating Federated Genomic Data Analysis by Identifying Record Correlations while Ensuring Privacy.

Dervishi, Leonard; Wang, Xinyue; Li, Wentao; Halimi, Anisa; Vaidya, Jaideep; Jiang, Xiaoqian; Ayday, Erman.

AMIA Annu Symp Proc ; 2022: 395-404, 2022.

Article in English | MEDLINE | ID: mdl-37128365

ABSTRACT

With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore, there may be correlations in the data, which if not detected, can impact the analysis. In this paper, we take the first step towards identifying correlated records across multiple data repositories in a privacy-preserving manner. The proposed framework, based on random shuffling, synthetic record generation, and local differential privacy, allows a trade-off of accuracy and computational efficiency. An extensive evaluation on real genomic data from the OpenSNP dataset shows that the proposed solution is efficient and effective.

Subject(s)

Computer Security , Privacy , Humans , Genomics , Data Collection

Privacy-Preserving and Efficient Verification of the Outcome in Genome-Wide Association Studies.

Halimi, Anisa; Dervishi, Leonard; Ayday, Erman; Pyrgelis, Apostolos; Troncoso-Pastoriza, Juan Ramón; Hubaux, Jean-Pierre; Jiang, Xiaoqian; Vaidya, Jaideep.

Proc Priv Enhanc Technol ; 2022(3): 732-753, 2022.

Article in English | MEDLINE | ID: mdl-36212774

ABSTRACT

Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. In this work, we propose a framework that verifies the correctness of the aggregate statistics obtained as a result of a genome-wide association study (GWAS) conducted by a researcher while protecting individuals' privacy in the researcher's dataset. In GWAS, the goal of the researcher is to identify highly associated point mutations (variants) with a given phenotype. The researcher publishes the workflow of the conducted study, its output, and associated metadata. They keep the research dataset private while providing, as part of the metadata, a partial noisy dataset (that achieves local differential privacy). To check the correctness of the workflow output, a verifier makes use of the workflow, its metadata, and results of another GWAS (conducted using publicly available datasets) to distinguish between correct statistics and incorrect ones. For evaluation, we use real genomic data and show that the correctness of the workflow output can be verified with high accuracy even when the aggregate statistics of a small number of variants are provided. We also quantify the privacy leakage due to the provided workflow and its associated metadata and show that the additional privacy risk due to the provided metadata does not increase the existing privacy risk due to sharing of the research results. Thus, our results show that the workflow output (i.e., research results) can be verified with high confidence in a privacy-preserving way. We believe that this work will be a valuable step towards providing provenance in a privacy-preserving way while providing guarantees to the users about the correctness of the results.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL