RESUMEN
Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.
Asunto(s)
Genómica , Aprendizaje , Bases de Datos Factuales , Haplotipos , Redes Neurales de la ComputaciónRESUMEN
Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.
Asunto(s)
Simulación por Computador , Genoma Humano , Aprendizaje Automático , Población/genética , Algoritmos , Alelos , Cromosomas Humanos Par 15/genética , Bases de Datos Factuales , Bases de Datos Genéticas , Aprendizaje Profundo , Proyecto Mapa de Haplotipos , Humanos , Cadenas de Markov , Redes Neurales de la Computación , Polimorfismo de Nucleótido SimpleRESUMEN
In this Letter we propose a new method to infer the topology of the interaction network in pairwise models with Ising variables. By using the pseudolikelihood method (PLM) at high temperature, it is generally possible to distinguish between zero and nonzero couplings because a clear gap separate the two groups. However at lower temperatures the PLM is much less effective and the result depends on subjective choices, such as the value of the â1 regularizer and that of the threshold to separate nonzero couplings from null ones. We introduce a decimation procedure based on the PLM that recursively sets to zero the less significant couplings, until the variation of the pseudolikelihood signals that relevant couplings are being removed. The new method is fully automated and does not require any subjective choice by the user. Numerical tests have been performed on a wide class of Ising models, having different topologies (from random graphs to finite dimensional lattices) and different couplings (both diluted ferromagnets in a field and spin glasses). These numerical results show that the new algorithm performs better than standard PLM.
RESUMEN
We characterize the equilibrium properties of a model of y coupled binary perceptrons in the teacher-student scenario, subject to a suitable cost function, with an explicit ferromagnetic coupling proportional to the Hamming distance between the students' weights. In contrast to recent works, we analyze a more general setting in which thermal noise is present that affects each student's generalization performance. In the nonzero temperature regime, we find that the coupling of replicas leads to a bend of the phase diagram towards smaller values of α: This suggests that the free entropy landscape gets smoother around the solution with perfect generalization (i.e., the teacher) at a fixed fraction of examples, allowing standard thermal updating algorithms such as Simulated Annealing to easily reach the teacher solution and avoid getting trapped in metastable states as happens in the unreplicated case, even in the computationally easy regime of the inference phase diagram. These results provide additional analytic and numerical evidence for the recently conjectured Bayes-optimal property of Replicated Simulated Annealing for a sufficient number of replicas. From a learning perspective, these results also suggest that multiple students working together (in this case reviewing the same data) are able to learn the same rule both significantly faster and with fewer examples, a property that could be exploited in the context of cooperative and federated learning.
RESUMEN
Data sets in the real world are often complex and to some degree hierarchical, with groups and subgroups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these data sets is an important task that has many practical applications. To address this challenge, we present a general method for building relational data trees by exploiting the learning dynamics of the restricted Boltzmann machine. Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical data set and on three different real-world data sets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.
RESUMEN
A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of D-dimensional datapoints. In the particular case of manifold learning for ridge detection, we assume that the underlying structure can be modeled as a graph acting like a topological prior for the Gaussian clusters turning the problem into a maximum a posteriori estimation. Parameters of the model are iteratively estimated through an Expectation-Maximization procedure making the learning of the structure computationally efficient with guaranteed convergence for any graph prior in a polynomial time. We also embed in the formalism a natural way to make the algorithm robust to outliers of the pattern and heteroscedasticity of the manifold sampling coherently with the graph structure. The method uses a graph prior given by the minimum spanning tree that we extend using random sub-samplings of the dataset to take into account cycles that can be observed in the spatial distribution.
RESUMEN
We study the probability distribution of the pseudocritical temperature in a mean-field and in a short-range spin-glass model: the Sherrington-Kirkpatrick and the Edwards-Anderson (EA) model. In both cases, we put in evidence the underlying connection between the fluctuations of the pseudocritical point and the extreme value statistics of random variables. For the Sherrington-Kirkpatrick model, both with Gaussian and binary couplings, the distribution of the pseudocritical temperature is found to be the Tracy-Widom distribution. For the EA model, the distribution is found to be the Gumbel distribution. Being the EA model representative of uniaxial magnetic materials with quenched disorder like Fe(0.5)Mn)0.5)TiO(3) or Eu(0.5)Ba(0.5)MnO(3), its pseudocritical point distribution should be a priori experimentally accessible.
RESUMEN
We present an asymptotically exact analysis of the problem of detecting communities in sparse random networks generated by stochastic block models. Using the cavity method of statistical physics and its relationship to belief propagation, we unveil a phase transition from a regime where we can infer the correct group assignments of the nodes to one where these groups are undetectable. Our approach yields an optimal inference algorithm for detecting modules, including both assortative and disassortative functional modules, assessing their significance, and learning the parameters of the underlying block model. Our algorithm is scalable and applicable to real-world networks, as long as they are well described by the block model.
RESUMEN
We propose an efficient algorithm to solve inverse problems in the presence of binary clustered datasets. We consider the paradigmatic Hopfield model in a teacher student scenario, where this situation is found in the retrieval phase. This problem has been widely analyzed through various methods such as mean-field approaches or the pseudo-likelihood optimization. Our approach is based on the estimation of the posterior using the Thouless-Anderson-Palmer (TAP) equations in a parallel updating scheme. Unlike other methods, it allows to retrieve the original patterns of the teacher dataset and thanks to the parallel update it can be applied to large system sizes. We tackle the same problem using a restricted Boltzmann machine (RBM) and discuss analogies and differences between our algorithm and RBM learning.
RESUMEN
We present a framework exploiting the cascade of phase transitions occurring during a simulated annealing of the expectation-maximization algorithm to cluster datasets with multiscale structures. Using the weighted local covariance, we can extract, a posteriori and without any prior knowledge, information on the number of clusters at different scales together with their size. We also study the linear stability of the iterative scheme to derive the threshold at which the first transition occurs and show how to approximate the next ones. Finally, we combine simulated annealing together with recent developments of regularized Gaussian mixture models to learn a principal graph from spatially structured datasets that can also exhibit many scales.
RESUMEN
We introduce a random energy model on a hierarchical lattice where the interaction strength between variables is a decreasing function of their mutual hierarchical distance, making it a non-mean-field model. Through small coupling series expansion and a direct numerical solution of the model, we provide evidence for a spin-glass condensation transition similar to the one occurring in the usual mean-field random energy model. At variance with the mean field, the high temperature branch of the free-energy is nonanalytic at the transition point.
RESUMEN
In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models).
RESUMEN
In this paper we study the inference of the kinetic Ising model on sparse graphs by the decimation method. The decimation method, which was first proposed in Decelle and Ricci-Tersenghi [Phys. Rev. Lett. 112, 070603 (2014)] for the static inverse Ising problem, tries to recover the topology of the inferred system by setting the weakest couplings to zero iteratively. During the decimation process the likelihood function is maximized over the remaining couplings. Unlike the â(1)-optimization-based methods, the decimation method does not use the Laplace distribution as a heuristic choice of prior to select a sparse solution. In our case, the whole process can be done auto-matically without fixing any parameters by hand. We show that in the dynamical inference problem, where the task is to reconstruct the couplings of an Ising model given the data, the decimation process can be applied naturally into a maximum-likelihood optimization algorithm, as opposed to the static case where pseudolikelihood method needs to be adopted. We also use extensive numerical studies to validate the accuracy of our methods in dynamical inference problems. Our results illustrate that, on various topologies and with different distribution of couplings, the decimation method outperforms the widely used â(1)-optimization-based methods.
RESUMEN
The renormalization group (RG) methods are still far from being completely understood in quenched disordered systems. In order to gain insight into the nature of the phase transition of these systems, it is common to investigate simple models. In this work we study a real-space RG transformation on the Dyson hierarchical lattice with a random field, which leads to a reconstruction of the RG flow and to an evaluation of the critical exponents of the model at T=0. We show that this method gives very accurate estimations of the critical exponents by comparing our results with those obtained by some of us using an independent method.
RESUMEN
In this paper we extend our previous work on the stochastic block model, a commonly used generative model for social and biological networks, and the problem of inferring functional groups or communities from the topology of the network. We use the cavity method of statistical physics to obtain an asymptotically exact analysis of the phase diagram. We describe in detail properties of the detectability-undetectability phase transition and the easy-hard phase transition for the community detection problem. Our analysis translates naturally into a belief propagation algorithm for inferring the group memberships of the nodes in an optimal way, i.e., that maximizes the overlap with the underlying group memberships, and learning the underlying parameters of the block model. Finally, we apply the algorithm to two examples of real-world networks and discuss its performance.
RESUMEN
We use a power grid model with M generators and N consumption units to optimize the grid and its control. Each consumer demand is drawn from a predefined finite-size-support distribution, thus simulating the instantaneous load fluctuations. Each generator has a maximum power capability. A generator is not overloaded if the sum of the loads of consumers connected to a generator does not exceed its maximum production. In the standard grid each consumer is connected only to its designated generator, while we consider a more general organization of the grid allowing each consumer to select one generator depending on the load from a predefined consumer dependent and sufficiently small set of generators which can all serve the load. The model grid is interconnected in a graph with loops, drawn from an ensemble of random bipartite graphs, while each allowed configuration of loaded links represent a set of graph covering trees. Losses, the reactive character of the grid and the transmission-level connections between generators (and many other details relevant to realistic power grid) are ignored in this proof-of-principles study. We focus on the asymptotic limit, N-->infinity and N/M-->D=O(1)>1 , and we show that the interconnects allow significant expansion of the parameter domains for which the probability of a generator overload is asymptotically zero. Our construction explores the formal relation between the problem of grid optimization and the modern theory of sparse graphical models. We also design heuristic algorithms that achieve the asymptotically optimal selection of loaded links. We conclude discussing the ability of this approach to include other effects such as a more realistic modeling of the power grid and related optimization and control algorithms.