Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
PLoS Comput Biol ; 19(10): e1011584, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37903158

ABSTRACT

Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.


Subject(s)
Genomics , Learning , Databases, Factual , Haplotypes , Neural Networks, Computer
2.
PLoS Genet ; 17(2): e1009303, 2021 02.
Article in English | MEDLINE | ID: mdl-33539374

ABSTRACT

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.


Subject(s)
Computer Simulation , Genome, Human , Machine Learning , Population/genetics , Algorithms , Alleles , Chromosomes, Human, Pair 15/genetics , Databases, Factual , Databases, Genetic , Deep Learning , HapMap Project , Humans , Markov Chains , Neural Networks, Computer , Polymorphism, Single Nucleotide
3.
Phys Rev Lett ; 112(7): 070603, 2014 Feb 21.
Article in English | MEDLINE | ID: mdl-24579583

ABSTRACT

In this Letter we propose a new method to infer the topology of the interaction network in pairwise models with Ising variables. By using the pseudolikelihood method (PLM) at high temperature, it is generally possible to distinguish between zero and nonzero couplings because a clear gap separate the two groups. However at lower temperatures the PLM is much less effective and the result depends on subjective choices, such as the value of the ℓ1 regularizer and that of the threshold to separate nonzero couplings from null ones. We introduce a decimation procedure based on the PLM that recursively sets to zero the less significant couplings, until the variation of the pseudolikelihood signals that relevant couplings are being removed. The new method is fully automated and does not require any subjective choice by the user. Numerical tests have been performed on a wide class of Ising models, having different topologies (from random graphs to finite dimensional lattices) and different couplings (both diluted ferromagnets in a field and spin glasses). These numerical results show that the new algorithm performs better than standard PLM.

4.
Phys Rev E ; 108(1-1): 014110, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37583157

ABSTRACT

Data sets in the real world are often complex and to some degree hierarchical, with groups and subgroups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these data sets is an important task that has many practical applications. To address this challenge, we present a general method for building relational data trees by exploiting the learning dynamics of the restricted Boltzmann machine. Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical data set and on three different real-world data sets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.

5.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 9119-9130, 2022 Dec.
Article in English | MEDLINE | ID: mdl-34757901

ABSTRACT

A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of D-dimensional datapoints. In the particular case of manifold learning for ridge detection, we assume that the underlying structure can be modeled as a graph acting like a topological prior for the Gaussian clusters turning the problem into a maximum a posteriori estimation. Parameters of the model are iteratively estimated through an Expectation-Maximization procedure making the learning of the structure computationally efficient with guaranteed convergence for any graph prior in a polynomial time. We also embed in the formalism a natural way to make the algorithm robust to outliers of the pattern and heteroscedasticity of the manifold sampling coherently with the graph structure. The method uses a graph prior given by the minimum spanning tree that we extend using random sub-samplings of the dataset to take into account cycles that can be observed in the spatial distribution.

6.
Phys Rev Lett ; 107(6): 065701, 2011 Aug 05.
Article in English | MEDLINE | ID: mdl-21902340

ABSTRACT

We present an asymptotically exact analysis of the problem of detecting communities in sparse random networks generated by stochastic block models. Using the cavity method of statistical physics and its relationship to belief propagation, we unveil a phase transition from a regime where we can infer the correct group assignments of the nodes to one where these groups are undetectable. Our approach yields an optimal inference algorithm for detecting modules, including both assortative and disassortative functional modules, assessing their significance, and learning the parameters of the underlying block model. Our algorithm is scalable and applicable to real-world networks, as long as they are well described by the block model.

7.
Phys Rev Lett ; 107(27): 275701, 2011 Dec 30.
Article in English | MEDLINE | ID: mdl-22243317

ABSTRACT

We study the probability distribution of the pseudocritical temperature in a mean-field and in a short-range spin-glass model: the Sherrington-Kirkpatrick and the Edwards-Anderson (EA) model. In both cases, we put in evidence the underlying connection between the fluctuations of the pseudocritical point and the extreme value statistics of random variables. For the Sherrington-Kirkpatrick model, both with Gaussian and binary couplings, the distribution of the pseudocritical temperature is found to be the Tracy-Widom distribution. For the EA model, the distribution is found to be the Gumbel distribution. Being the EA model representative of uniaxial magnetic materials with quenched disorder like Fe(0.5)Mn)0.5)TiO(3) or Eu(0.5)Ba(0.5)MnO(3), its pseudocritical point distribution should be a priori experimentally accessible.

8.
Sci Rep ; 11(1): 19990, 2021 Oct 07.
Article in English | MEDLINE | ID: mdl-34620934

ABSTRACT

We propose an efficient algorithm to solve inverse problems in the presence of binary clustered datasets. We consider the paradigmatic Hopfield model in a teacher student scenario, where this situation is found in the retrieval phase. This problem has been widely analyzed through various methods such as mean-field approaches or the pseudo-likelihood optimization. Our approach is based on the estimation of the posterior using the Thouless-Anderson-Palmer (TAP) equations in a parallel updating scheme. Unlike other methods, it allows to retrieve the original patterns of the teacher dataset and thanks to the parallel update it can be applied to large system sizes. We tackle the same problem using a restricted Boltzmann machine (RBM) and discuss analogies and differences between our algorithm and RBM learning.

9.
Phys Rev E ; 103(1-1): 012105, 2021 Jan.
Article in English | MEDLINE | ID: mdl-33601533

ABSTRACT

We present a framework exploiting the cascade of phase transitions occurring during a simulated annealing of the expectation-maximization algorithm to cluster datasets with multiscale structures. Using the weighted local covariance, we can extract, a posteriori and without any prior knowledge, information on the number of clusters at different scales together with their size. We also study the linear stability of the iterative scheme to derive the threshold at which the first transition occurs and show how to approximate the next ones. Finally, we combine simulated annealing together with recent developments of regularized Gaussian mixture models to learn a principal graph from spatially structured datasets that can also exhibit many scales.

10.
Phys Rev Lett ; 104(12): 127206, 2010 Mar 26.
Article in English | MEDLINE | ID: mdl-20366564

ABSTRACT

We introduce a random energy model on a hierarchical lattice where the interaction strength between variables is a decreasing function of their mutual hierarchical distance, making it a non-mean-field model. Through small coupling series expansion and a direct numerical solution of the model, we provide evidence for a spin-glass condensation transition similar to the one occurring in the usual mean-field random energy model. At variance with the mean field, the high temperature branch of the free-energy is nonanalytic at the transition point.

11.
Phys Rev E ; 94(1-1): 012112, 2016 Jul.
Article in English | MEDLINE | ID: mdl-27575082

ABSTRACT

In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models).

12.
Article in English | MEDLINE | ID: mdl-26066148

ABSTRACT

In this paper we study the inference of the kinetic Ising model on sparse graphs by the decimation method. The decimation method, which was first proposed in Decelle and Ricci-Tersenghi [Phys. Rev. Lett. 112, 070603 (2014)] for the static inverse Ising problem, tries to recover the topology of the inferred system by setting the weakest couplings to zero iteratively. During the decimation process the likelihood function is maximized over the remaining couplings. Unlike the ℓ(1)-optimization-based methods, the decimation method does not use the Laplace distribution as a heuristic choice of prior to select a sparse solution. In our case, the whole process can be done auto-matically without fixing any parameters by hand. We show that in the dynamical inference problem, where the task is to reconstruct the couplings of an Ising model given the data, the decimation process can be applied naturally into a maximum-likelihood optimization algorithm, as opposed to the static case where pseudolikelihood method needs to be adopted. We also use extensive numerical studies to validate the accuracy of our methods in dynamical inference problems. Our results illustrate that, on various topologies and with different distribution of couplings, the decimation method outperforms the widely used ℓ(1)-optimization-based methods.

13.
Article in English | MEDLINE | ID: mdl-24730815

ABSTRACT

The renormalization group (RG) methods are still far from being completely understood in quenched disordered systems. In order to gain insight into the nature of the phase transition of these systems, it is common to investigate simple models. In this work we study a real-space RG transformation on the Dyson hierarchical lattice with a random field, which leads to a reconstruction of the RG flow and to an evaluation of the critical exponents of the model at T=0. We show that this method gives very accurate estimations of the critical exponents by comparing our results with those obtained by some of us using an independent method.

14.
Phys Rev E Stat Nonlin Soft Matter Phys ; 84(6 Pt 2): 066106, 2011 Dec.
Article in English | MEDLINE | ID: mdl-22304154

ABSTRACT

In this paper we extend our previous work on the stochastic block model, a commonly used generative model for social and biological networks, and the problem of inferring functional groups or communities from the topology of the network. We use the cavity method of statistical physics to obtain an asymptotically exact analysis of the phase diagram. We describe in detail properties of the detectability-undetectability phase transition and the easy-hard phase transition for the community detection problem. Our analysis translates naturally into a belief propagation algorithm for inferring the group memberships of the nodes in an optimal way, i.e., that maximizes the overlap with the underlying group memberships, and learning the underlying parameters of the block model. Finally, we apply the algorithm to two examples of real-world networks and discuss its performance.

15.
Phys Rev E Stat Nonlin Soft Matter Phys ; 80(4 Pt 2): 046112, 2009 Oct.
Article in English | MEDLINE | ID: mdl-19905395

ABSTRACT

We use a power grid model with M generators and N consumption units to optimize the grid and its control. Each consumer demand is drawn from a predefined finite-size-support distribution, thus simulating the instantaneous load fluctuations. Each generator has a maximum power capability. A generator is not overloaded if the sum of the loads of consumers connected to a generator does not exceed its maximum production. In the standard grid each consumer is connected only to its designated generator, while we consider a more general organization of the grid allowing each consumer to select one generator depending on the load from a predefined consumer dependent and sufficiently small set of generators which can all serve the load. The model grid is interconnected in a graph with loops, drawn from an ensemble of random bipartite graphs, while each allowed configuration of loaded links represent a set of graph covering trees. Losses, the reactive character of the grid and the transmission-level connections between generators (and many other details relevant to realistic power grid) are ignored in this proof-of-principles study. We focus on the asymptotic limit, N-->infinity and N/M-->D=O(1)>1 , and we show that the interconnects allow significant expansion of the parameter domains for which the probability of a generator overload is asymptotically zero. Our construction explores the formal relation between the problem of grid optimization and the modern theory of sparse graphical models. We also design heuristic algorithms that achieve the asymptotically optimal selection of loaded links. We conclude discussing the ability of this approach to include other effects such as a more realistic modeling of the power grid and related optimization and control algorithms.


Subject(s)
Algorithms , Electric Power Supplies , Electricity , Information Storage and Retrieval/methods , Models, Theoretical , Computer Simulation
SELECTION OF CITATIONS
SEARCH DETAIL