Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
1.
Genome Res ; 2024 Aug 07.
Artículo en Inglés | MEDLINE | ID: mdl-39111815

RESUMEN

Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.

2.
Res Comput Mol Biol ; 14758: 308-313, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39027313

RESUMEN

Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the significant burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 94.9% of third-degree relatives, and 99.9% of second-degree or closer relatives, within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.

3.
Genome Biol ; 24(1): 271, 2023 Dec 06.
Artículo en Inglés | MEDLINE | ID: mdl-38053191

RESUMEN

BACKGROUND: Genotype imputation is an essential step in genetic studies to improve data quality and statistical power. Public imputation servers are widely used by researchers to impute their data using otherwise access-controlled reference panels of high-fidelity genomes held by these servers. RESULTS: We report evidence against the prevailing assumption that providing access to panels only indirectly via imputation servers poses a negligible privacy risk to individuals in the panels. To this end, we present algorithmic strategies for adaptively constructing artificial input samples and interpreting their imputation results that lead to the accurate reconstruction of reference panel haplotypes. We illustrate this possibility on three reference panels of real genomes for a range of imputation tools and output settings. Moreover, we demonstrate that reconstructed haplotypes from the same individual could be linked via their genetic relatives using our Bayesian linking algorithm, which allows a substantial portion of the individual's diploid genome to be reassembled. We also provide population genetic estimates of the proportion of a panel that could be linked when an adversary holds a varying number of genomes from the same population. CONCLUSIONS: Our results show that genomes in imputation server reference panels can be vulnerable to reconstruction, implying that additional safeguards may need to be considered. We suggest possible mitigation measures based on our findings. Our work illustrates the value of adversarial algorithms in uncovering new privacy risks to help inform the genomics community towards secure data sharing practices.


Asunto(s)
Estudio de Asociación del Genoma Completo , Genoma , Humanos , Teorema de Bayes , Estudio de Asociación del Genoma Completo/métodos , Genotipo , Haplotipos , Polimorfismo de Nucleótido Simple
4.
Genome Res ; 33(7): 1101-1112, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37541758

RESUMEN

Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.


Asunto(s)
Estudio de Asociación del Genoma Completo , Transcriptoma , Humanos , Perfilación de la Expresión Génica , Genotipo , Sitios de Carácter Cuantitativo , Polimorfismo de Nucleótido Simple
5.
Nucleic Acids Res ; 51(W1): W535-W541, 2023 07 05.
Artículo en Inglés | MEDLINE | ID: mdl-37246709

RESUMEN

Advances in genomics are increasingly depending upon the ability to analyze large and diverse genomic data collections, which are often difficult to amass due to privacy concerns. Recent works have shown that it is possible to jointly analyze datasets held by multiple parties, while provably preserving the privacy of each party's dataset using cryptographic techniques. However, these tools have been challenging to use in practice due to the complexities of the required setup and coordination among the parties. We present sfkit, a secure and federated toolkit for collaborative genomic studies, to allow groups of collaborators to easily perform joint analyses of their datasets without compromising privacy. sfkit consists of a web server and a command-line interface, which together support a range of use cases including both auto-configured and user-supplied computational environments. sfkit provides collaborative workflows for the essential tasks of genome-wide association study (GWAS) and principal component analysis (PCA). We envision sfkit becoming a one-stop server for secure collaborative tools for a broad range of genomic analyses. sfkit is open-source and available at: https://sfkit.org.


Asunto(s)
Estudio de Asociación del Genoma Completo , Genómica , Programas Informáticos , Estudio de Asociación del Genoma Completo/métodos , Genómica/métodos , Internet , Privacidad , Flujo de Trabajo
6.
Genome Biol ; 24(1): 5, 2023 01 11.
Artículo en Inglés | MEDLINE | ID: mdl-36631897

RESUMEN

Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3-4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.


Asunto(s)
Biología Computacional , Difusión de la Información
7.
Proc IEEE Symp Secur Priv ; 2023: 1908-1925, 2023 May.
Artículo en Inglés | MEDLINE | ID: mdl-38665901

RESUMEN

Principal component analysis (PCA) is an essential algorithm for dimensionality reduction in many data science domains. We address the problem of performing a federated PCA on private data distributed among multiple data providers while ensuring data confidentiality. Our solution, SF-PCA, is an end-to-end secure system that preserves the confidentiality of both the original data and all intermediate results in a passive-adversary model with up to all-but-one colluding parties. SF-PCA jointly leverages multiparty homomorphic encryption, interactive protocols, and edge computing to efficiently interleave computations on local cleartext data with operations on collectively encrypted data. SF-PCA obtains results as accurate as non-secure centralized solutions, independently of the data distribution among the parties. It scales linearly or better with the dataset dimensions and with the number of data providers. SF-PCA is more precise than existing approaches that approximate the solution by combining local analysis results, and between 3x and 250x faster than privacy-preserving alternatives based solely on secure multiparty computation or homomorphic encryption. Our work demonstrates the practical applicability of secure and federated PCA on private distributed datasets.

9.
IEEE Trans Inf Theory ; 68(6): 4090-4105, 2022 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-37283781

RESUMEN

Motivated by the growing availability of personal genomics services, we study an information-theoretic privacy problem that arises when sharing genomic data: a user wants to share his or her genome sequence while keeping the genotypes at certain positions hidden, which could otherwise reveal critical health-related information. A straightforward solution of erasing (masking) the chosen genotypes does not ensure privacy, because the correlation between nearby positions can leak the masked genotypes. We introduce an erasure-based privacy mechanism with perfect information-theoretic privacy, whereby the released sequence is statistically independent of the sensitive genotypes. Our mechanism can be interpreted as a locally-optimal greedy algorithm for a given processing order of sequence positions, where utility is measured by the number of positions released without erasure. We show that finding an optimal order is NP-hard in general and provide an upper bound on the optimal utility. For sequences from hidden Markov models, a standard modeling approach in genetics, we propose an efficient algorithmic implementation of our mechanism with complexity polynomial in sequence length. Moreover, we illustrate the robustness of the mechanism by bounding the privacy leakage from erroneous prior distributions. Our work is a step towards more rigorous control of privacy in genomic data sharing.

10.
Comput Vis ECCV ; 13681: 661-678, 2022 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37525827

RESUMEN

The application of modern machine learning to retinal image analyses offers valuable insights into a broad range of human health conditions beyond ophthalmic diseases. Additionally, data sharing is key to fully realizing the potential of machine learning models by providing a rich and diverse collection of training data. However, the personallyidentifying nature of retinal images, encompassing the unique vascular structure of each individual, often prevents this data from being shared openly. While prior works have explored image de-identification strategies based on synthetic averaging of images in other domains (e.g. facial images), existing techniques face difficulty in preserving both privacy and clinical utility in retinal images, as we demonstrate in our work. We therefore introduce k-SALSA, a generative adversarial network (GAN)-based framework for synthesizing retinal fundus images that summarize a given private dataset while satisfying the privacy notion of k-anonymity. k-SALSA brings together state-of-the-art techniques for training and inverting GANs to achieve practical performance on retinal images. Furthermore, k-SALSA leverages a new technique, called local style alignment, to generate a synthetic average that maximizes the retention of fine-grain visual patterns in the source images, thus improving the clinical utility of the generated images. On two benchmark datasets of diabetic retinopathy (EyePACS and APTOS), we demonstrate our improvement upon existing methods with respect to image fidelity, classification performance, and mitigation of membership inference attacks. Our work represents a step toward broader sharing of retinal images for scientific collaboration. Code is available at https://github.com/hcholab/k-salsa.

13.
Nat Commun ; 12(1): 5910, 2021 10 11.
Artículo en Inglés | MEDLINE | ID: mdl-34635645

RESUMEN

Using real-world evidence in biomedical research, an indispensable complement to clinical trials, requires access to large quantities of patient data that are typically held separately by multiple healthcare institutions. We propose FAMHE, a novel federated analytics system that, based on multiparty homomorphic encryption (MHE), enables privacy-preserving analyses of distributed datasets by yielding highly accurate results without revealing any intermediate data. We demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics. Using our system, we accurately and efficiently reproduce two published centralized studies in a federated setting, enabling biomedical insights that are not possible from individual institutions alone. Our work represents a necessary key step towards overcoming the privacy hurdle in enabling multi-centric scientific collaborations.


Asunto(s)
Medicina de Precisión , Privacidad , Algoritmos , Seguridad Computacional , Atención a la Salud , Estudio de Asociación del Genoma Completo , Humanos , Estimación de Kaplan-Meier , Análisis de Supervivencia
14.
Cell Syst ; 12(10): 983-993.e7, 2021 10 20.
Artículo en Inglés | MEDLINE | ID: mdl-34450045

RESUMEN

Genotype imputation is an essential tool in genomics research, whereby missing genotypes are inferred using reference genomes to enhance downstream analyses. Recently, public imputation servers have allowed researchers to leverage large-scale genomic data resources for imputation. However, privacy concerns about uploading one's genetic data to a server limit the utility of these services. We introduce a secure hardware-based solution for privacy-preserving genotype imputation, which keeps the input genomes private by processing them within Intel SGX's trusted execution environment. Our solution features SMac, an efficient and secure imputation algorithm designed for Intel SGX, which employs a state-of-the-art imputation strategy also utilized by existing imputation servers. SMac achieves imputation accuracy equivalent to existing tools and provides protection against known side-channel attacks on SGX while maintaining scalability. We also show the necessity of our enhanced security by identifying vulnerabilities in existing imputation software. Our work represents a step toward privacy-preserving genomic analysis services.


Asunto(s)
Genómica , Privacidad , Algoritmos , Genotipo , Programas Informáticos
15.
Bioinformatics ; 37(Suppl_1): i349-i357, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252956

RESUMEN

MOTIVATION: Recent advances in single-cell RNA-sequencing (scRNA-seq) technologies promise to enable the study of gene regulatory associations at unprecedented resolution in diverse cellular contexts. However, identifying unique regulatory associations observed only in specific cell types or conditions remains a key challenge; this is particularly so for rare transcriptional states whose sample sizes are too small for existing gene regulatory network inference methods to be effective. RESULTS: We present ShareNet, a Bayesian framework for boosting the accuracy of cell type-specific gene regulatory networks by propagating information across related cell types via an information sharing structure that is adaptively optimized for a given single-cell dataset. The techniques we introduce can be used with a range of general network inference algorithms to enhance the output for each cell type. We demonstrate the enhanced accuracy of our approach on three benchmark scRNA-seq datasets. We find that our inferred cell type-specific networks also uncover key changes in gene associations that underpin the complex rewiring of regulatory networks across cell types, tissues and dynamic biological processes. Our work presents a path toward extracting deeper insights about cell type-specific gene regulation in the rapidly growing compendium of scRNA-seq datasets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AVAILABILITY AND IMPLEMENTATION: The code for ShareNet is available at http://sharenet.csail.mit.edu and https://github.com/alexw16/sharenet.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de la Célula Individual , Teorema de Bayes , Difusión de la Información , Análisis de Secuencia de ARN , Programas Informáticos
16.
Nat Biotechnol ; 39(6): 765-774, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-33462509

RESUMEN

Nonlinear data visualization methods, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), summarize the complex transcriptomic landscape of single cells in two dimensions or three dimensions, but they neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subsets of cells are given more visual space than warranted by their transcriptional diversity in the dataset. Here we present den-SNE and densMAP, which are density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to accurately incorporate information about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data. Applied to recently published datasets, our methods reveal significant changes in transcriptomic variability in a range of biological processes, including heterogeneity in transcriptomic variability of immune cells in blood and tumor, human immune cell specialization and the developmental trajectory of Caenorhabditis elegans. Our methods are readily applicable to visualizing high-dimensional data in other scientific domains.


Asunto(s)
Visualización de Datos , Análisis de la Célula Individual , Transcriptoma , Algoritmos , Perfilación de la Expresión Génica/métodos , Humanos , Análisis de Componente Principal
17.
Cell Syst ; 10(5): 408-416.e9, 2020 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-32359425

RESUMEN

Sharing data across research groups is an essential driver of biomedical research. While interactive query-answering systems for biomedical databases aim to facilitate the sharing of aggregate insights without divulging sensitive individual-level data, query answers can still leak private information about the individuals in the database. Here, we draw upon recent advances in differential privacy to introduce query-answering mechanisms that provably maximize the utility (e.g., accuracy) of the system while achieving formal privacy guarantees. We demonstrate our accuracy improvement over existing approaches for a range of use cases, including cohort discovery, variant lookup, and association testing. Our new theoretical results extend the proof of optimality of the underlying mechanism, previously known only for count queries with symmetric utility functions, to more general utility functions needed for key biomedical research workflows. Our work presents a path toward interactive biomedical databases that achieve the optimal privacy-utility trade-offs permitted by the theory of differential privacy.


Asunto(s)
Difusión de la Información/ética , Difusión de la Información/métodos , Bases de Datos Factuales , Genómica/métodos , Humanos , Registros Médicos , Privacidad
18.
Genome Biol ; 20(1): 128, 2019 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-31262363

RESUMEN

As the scale of genomic and health-related data explodes and our understanding of these data matures, the privacy of the individuals behind the data is increasingly at stake. Traditional approaches to protect privacy have fundamental limitations. Here we discuss emerging privacy-enhancing technologies that can enable broader data sharing and collaboration in genomics research.


Asunto(s)
Privacidad Genética , Genómica/ética , Difusión de la Información/ética , Genoma Humano , Humanos
19.
Cell Syst ; 8(6): 483-493.e7, 2019 06 26.
Artículo en Inglés | MEDLINE | ID: mdl-31176620

RESUMEN

Large-scale single-cell RNA sequencing (scRNA-seq) studies that profile hundreds of thousands of cells are becoming increasingly common, overwhelming existing analysis pipelines. Here, we describe how to enhance and accelerate single-cell data analysis by summarizing the transcriptomic heterogeneity within a dataset using a small subset of cells, which we refer to as a geometric sketch. Our sketches provide more comprehensive visualization of transcriptional diversity, capture rare cell types with high sensitivity, and reveal biological cell types via clustering. Our sketch of umbilical cord blood cells uncovers a rare subpopulation of inflammatory macrophages, which we experimentally validated. The construction of our sketches is extremely fast, which enabled us to accelerate other crucial resource-intensive tasks, such as scRNA-seq data integration, while maintaining accuracy. We anticipate our algorithm will become an increasingly essential step when sharing and analyzing the rapidly growing volume of scRNA-seq data and help enable the democratization of single-cell omics.


Asunto(s)
Análisis de la Célula Individual/métodos , Transcriptoma , Algoritmos , Animales , Análisis de Datos , Conjuntos de Datos como Asunto , Heterogeneidad Genética , Humanos , Macrófagos , RNA-Seq
20.
Proc Mach Learn Res ; 89: 1832-1840, 2019 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-32832915

RESUMEN

Representing data in hyperbolic space can effectively capture latent hierarchical relationships. To enable accurate classification of points in hyperbolic space while respecting their hyperbolic geometry, we introduce hyperbolic SVM, a hyperbolic formulation of support vector machine classifiers, and describe its theoretical connection to the Euclidean counterpart. We also generalize Euclidean kernel SVM to hyperbolic space, allowing nonlinear hyperbolic decision boundaries and providing a geometric interpretation for a certain class of indefinite kernels. Hyperbolic SVM improves classification accuracy in simulation and in real-world problems involving complex networks and word embeddings. Our work enables end-to-end analyses based on the inherent hyperbolic geometry of the data without resorting to ill-fitting tools developed for Euclidean space.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA