Search | Nursing VHL Search Portal

Deep convolutional and conditional neural networks for large-scale genomic data generation.

Yelmen, Burak; Decelle, Aurélien; Boulos, Leila Lea; Szatkownik, Antoine; Furtlehner, Cyril; Charpiat, Guillaume; Jay, Flora.

PLoS Comput Biol ; 19(10): e1011584, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37903158

ABSTRACT

Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.

Subject(s)

Genomics , Learning , Databases, Factual , Haplotypes , Neural Networks, Computer

Creating artificial human genomes using generative neural networks.

Yelmen, Burak; Decelle, Aurélien; Ongaro, Linda; Marnetto, Davide; Tallec, Corentin; Montinaro, Francesco; Furtlehner, Cyril; Pagani, Luca; Jay, Flora.

PLoS Genet ; 17(2): e1009303, 2021 02.

Article in English | MEDLINE | ID: mdl-33539374

ABSTRACT

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.

Subject(s)

Computer Simulation , Genome, Human , Machine Learning , Population/genetics , Algorithms , Alleles , Chromosomes, Human, Pair 15/genetics , Databases, Factual , Databases, Genetic , Deep Learning , HapMap Project , Humans , Markov Chains , Neural Networks, Computer , Polymorphism, Single Nucleotide

Scaling analysis of affinity propagation.

Furtlehner, Cyril; Sebag, Michèle; Zhang, Xiangliang.

Phys Rev E Stat Nonlin Soft Matter Phys ; 81(6 Pt 2): 066102, 2010 Jun.

Article in English | MEDLINE | ID: mdl-20866473

ABSTRACT

We analyze and exploit some scaling properties of the affinity propagation (AP) clustering algorithm proposed by Frey and Dueck [Science 315, 972 (2007)]. Following a divide and conquer strategy we setup an exact renormalization-based approach to address the question of clustering consistency, in particular, how many cluster are present in a given data set. We first observe that the divide and conquer strategy, used on a large data set hierarchically reduces the complexity O(N2) to O(N((h+2)/(h+1))) , for a data set of size N and a depth h of the hierarchical strategy. For a data set embedded in a d -dimensional space, we show that this is obtained without notably damaging the precision except in dimension d=2 . In fact, for d larger than 2 the relative loss in precision scales such as N((2-d)/(h+1)d). Finally, under some conditions we observe that there is a value s* of the penalty coefficient, a free parameter used to fix the number of clusters, which separates a fragmentation phase (for ss*) of the underlying hidden cluster structure. At this precise point holds a self-similarity property which can be exploited by the hierarchical strategy to actually locate its position, as a result of an exact decimation procedure. From this observation, a strategy based on AP can be defined to find out how many clusters are present in a given data set.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL