RESUMO
MOTIVATION: The Wright-Fisher diffusion is important in population genetics in modelling the evolution of allele frequencies over time subject to the influence of biological phenomena such as selection, mutation and genetic drift. Simulating the paths of the process is challenging due to the form of the transition density. We present EWF, a robust and efficient sampler which returns exact draws for the diffusion and diffusion bridge processes, accounting for general models of selection including those with frequency dependence. RESULTS: Given a configuration of selection, mutation and endpoints, EWF returns draws at the requested sampling times from the law of the corresponding Wright-Fisher process. Output was validated by comparison to approximations of the transition density via the Kolmogorov-Smirnov test and QQ plots. AVAILABILITY AND IMPLEMENTATION: All softwares are available at https://github.com/JaroSant/EWF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genética Populacional , Modelos Genéticos , Frequência do Gene , Deriva Genética , Mutação , Seleção GenéticaRESUMO
Mathematical models of genetic evolution often come in pairs, connected by a so-called duality relation. The most seminal example are the Wright-Fisher diffusion and the Kingman coalescent, where the former describes the stochastic evolution of neutral allele frequencies in a large population forwards in time, and the latter describes the genetic ancestry of randomly sampled individuals from the population backwards in time. As well as providing a richer description than either model in isolation, duality often yields equations satisfied by quantities of interest. We employ the so-called Bernoulli factory - a celebrated tool in simulation-based computing - to derive duality relations for broad classes of genetics models. As concrete examples, we present Wright-Fisher diffusions with general drift functions, and Allen-Cahn equations with general, nonlinear forcing terms. The drift and forcing functions can be interpreted as the action of frequency-dependent selection. To our knowledge, this work is the first time a connection has been drawn between Bernoulli factories and duality in models of population genetics.
Assuntos
Deriva Genética , Modelos Genéticos , Humanos , Genética Populacional , Frequência do Gene , Simulação por Computador , Seleção GenéticaRESUMO
We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.
Assuntos
Modelos Genéticos , Software , Algoritmos , Teorema de Bayes , Cadeias de Markov , Filogenia , Recombinação Genética/genéticaRESUMO
We derive statistical tools to analyze the patterns of genetic variability produced by models related to seed banks; in particular the Kingman coalescent, its time-changed counterpart describing so-called weak seed banks, the strong seed bank coalescent, and the two-island structured coalescent. As (strong) seed banks stratify a population, we expect them to produce a signal comparable to population structure. We present tractable formulas for Wright's FST and the expected site frequency spectrum for these models, and show that they can distinguish between some models for certain ranges of parameters. We then use pseudo-marginal MCMC to show that the full likelihood can reliably distinguish between all models in the presence of parameter uncertainty under moderate stratification, and point out statistical pitfalls arising from stratification that is either too strong or too weak. We further show that it is possible to infer parameters, and in particular determine whether mutation is taking place in the (strong) seed bank.
Assuntos
Modelos Genéticos , Banco de Sementes , Mutação , ProbabilidadeRESUMO
We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these model classes. The null and alternative sampling distributions of the statistic are intractable, but its low dimensionality renders them amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. A key reason for this improvement is the use of multi-locus data, in particular averaging observed site frequency spectra across unlinked loci to reduce sampling variance. We also demonstrate the robustness of our method to nuisance and tuning parameters. Finally we show that the same kernel density estimates can be used to conduct parameter estimation, and argue that our method is readily generalisable for applications in model selection, parameter inference and experimental design.
Assuntos
Modelos Estatísticos , Taxa de Mutação , Crescimento Demográfico , Diploide , Funções Verossimilhança , Método de Monte Carlo , Densidade DemográficaRESUMO
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Assuntos
Modelos Genéticos , Recombinação Genética , Evolução Molecular , Software , Genoma , HumanosRESUMO
Highly fecund natural populations characterized by high early mortality abound, yet our knowledge about their recruitment dynamics is somewhat rudimentary. This knowledge gap has implications for our understanding of genetic variation, population connectivity, local adaptation, and the resilience of highly fecund populations. The concept of sweepstakes reproductive success, which posits a considerable variance and skew in individual reproductive output, is key to understanding the distribution of individual reproductive success. However, it still needs to be determined whether highly fecund organisms reproduce through sweepstakes and, if they do, the relative roles of neutral and selective sweepstakes. Here, we use coalescent-based statistical analysis of population genomic data to show that selective sweepstakes likely explain recruitment dynamics in the highly fecund Atlantic cod. We show that the Kingman coalescent (modelling no sweepstakes) and the Xi-Beta coalescent (modelling random sweepstakes), including complex demography and background selection, do not provide an adequate fit for the data. The Durrett-Schweinsberg coalescent, in which selective sweepstakes result from recurrent and pervasive selective sweeps of new mutations, offers greater explanatory power. Our results show that models of sweepstakes reproduction and multiple-merger coalescents are relevant and necessary for understanding genetic diversity in highly fecund natural populations. These findings have fundamental implications for understanding the recruitment variation of fish stocks and general evolutionary genomics of high-fecundity organisms.
Assuntos
Fertilidade , Reprodução , Animais , Reprodução/genética , Fertilidade/genética , Aclimatação , Evolução Biológica , GenômicaRESUMO
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). New developments have made it possible to infer ARGs at scale, enabling many new applications in population and statistical genetics. This rapid progress, however, has led to a substantial gap opening between theory and practice. Standard mathematical formalisms, based on exhaustively detailing the "events" that occur in the history of a sample, are insufficient to describe the outputs of current methods. Moreover, we argue that the underlying assumption that all events can be known and precisely estimated is fundamentally unsuited to the realities of modern, population-scale datasets. We propose an alternative mathematical formulation that encompasses the outputs of recent methods and can capture the full richness of modern large-scale datasets. By defining this ARG encoding in terms of specific genomes and their intervals of genetic inheritance, we avoid the need to exhaustively list (and estimate) all events. The effects of multiple events can be aggregated in different ways, providing a natural way to express many forms of approximate and partial knowledge about the recombinant ancestry of a sample.
RESUMO
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Assuntos
Algoritmos , Modelos Genéticos , Simulação por Computador , Genética Populacional , Mutação , SoftwareRESUMO
We study the effect of biological confounders on the model selection problem between Kingman coalescents with population growth, and Ξ-coalescents involving simultaneous multiple mergers. We use a low dimensional, computationally tractable summary statistic, dubbed the singleton-tail statistic, to carry out approximate likelihood ratio tests between these model classes. The singleton-tail statistic has been shown to distinguish between them with high power in the simple setting of neutrally evolving, panmictic populations without recombination. We extend this work by showing that cryptic recombination and selection do not diminish the power of the test, but that misspecifying population structure does. Furthermore, we demonstrate that the singleton-tail statistic can also solve the more challenging model selection problem between multiple mergers due to selective sweeps, and multiple mergers due to high fecundity with moderate power of up to 30%.