RESUMEN
We study the time to the most recent common ancestor of a sample of finite size in a wide class of genealogical models for populations with variable size. This is made possible by recently developed results on inhomogeneous phase-type random variables, allowing us to obtain the density and the moments of the TMRCA of time-dependent coalescent processes in terms of matrix formulas. We also provide matrix simplifications permitting a more straightforward calculation. With these results, the TMRCA provides an explicative variable to distinguish different evolutionary scenarios.
RESUMEN
Uncovering the fundamental processes that shape genomic variation in natural populations is a primary objective of population genetics. These processes include demographic effects such as past changes in effective population size or gene flow between structured populations. Furthermore, genomic variation is affected by selection on nonneutral genetic variants, for example, through the adaptation of beneficial alleles or balancing selection that maintains genetic variation. In this article, we discuss the characterization of these processes using population genetic models, and we review methods developed on the basis of these models to unravel the underlying processes from modern population genomic data sets. We briefly discuss the conditions in which these approaches can be used to infer demography or identify specific nonneutral genetic variants and cases in which caution is warranted. Moreover, we summarize the challenges of jointly inferring demography and selective processes that affect neutral variation genome-wide.
RESUMEN
Detecting and quantifying the strength of selection is a main objective in population genetics. Since selection acts over multiple generations, many approaches have been developed to detect and quantify selection using genetic data sampled at multiple points in time. Such time series genetic data is commonly analyzed using Hidden Markov Models, but in most cases, under the assumption of additive selection. However, many examples of genetic variation exhibiting non-additive mechanisms exist, making it critical to develop methods that can characterize selection in more general scenarios. Thus, we extend a previously introduced expectation-maximization algorithm for the inference of additive selection coefficients to the case of general diploid selection, in which heterozygote and homozygote fitnesses are parameterized independently. We furthermore introduce a framework to identify bespoke modes of diploid selection from given data, as well as a procedure for aggregating data across linked loci to increase power and robustness. Using extensive simulation studies, we find that our method accurately and efficiently estimates selection coefficients for different modes of diploid selection across a wide range of scenarios; however, power to classify the mode of selection is low unless selection is very strong. We apply our method to ancient DNA samples from Great Britain in the last 4,450 years, and detect evidence for selection in six genomic regions, including the well-characterized LCT locus. Our work is the first genome-wide scan characterizing signals of general diploid selection.
RESUMEN
Summary: Whole-genome time-series allele frequency data are becoming more prevalent as ancient DNA (aDNA) sequences and data from evolve-and-resequence (E&R) experiments are generated at a rapid pace. Such data presents unprecedented opportunities to elucidate the dynamics of adaptative genetic variation. However, despite many methods to infer parameters of selection models from allele frequency trajectories available in the literature, few provide user-friendly implementations for large-scale empirical applications. Here, we present diplo-locus, an open-source Python package that provides functionality to simulate and perform inference from time-series under the Wright-Fisher diffusion with general diploid selection. The package includes Python modules as well as command-line tools. Availability: Python package and command-line tool avilable at: https://github.com/steinrue/diplo_locus or https://pypi.org/project/diplo-locus/.
RESUMEN
Barton et al.1 raise several statistical concerns regarding our original analyses2 that highlight the challenge of inferring natural selection using ancient genomic data. We show here that these concerns have limited impact on our original conclusions. Specifically, we recover the same signature of enrichment for high FST values at the immune loci relative to putatively neutral sites after switching the allele frequency estimation method to a maximum likelihood approach, filtering to only consider known human variants, and down-sampling our data to the same mean coverage across sites. Furthermore, using permutations, we show that the rs2549794 variant near ERAP2 continues to emerge as the strongest candidate for selection (p = 1.2×10-5), falling below the Bonferroni-corrected significance threshold recommended by Barton et al. Importantly, the evidence for selection on ERAP2 is further supported by functional data demonstrating the impact of the ERAP2 genotype on the immune response to Y. pestis and by epidemiological data from an independent group showing that the putatively selected allele during the Black Death protects against severe respiratory infection in contemporary populations.
RESUMEN
Infectious diseases are among the strongest selective pressures driving human evolution1,2. This includes the single greatest mortality event in recorded history, the first outbreak of the second pandemic of plague, commonly called the Black Death, which was caused by the bacterium Yersinia pestis3. This pandemic devastated Afro-Eurasia, killing up to 30-50% of the population4. To identify loci that may have been under selection during the Black Death, we characterized genetic variation around immune-related genes from 206 ancient DNA extracts, stemming from two different European populations before, during and after the Black Death. Immune loci are strongly enriched for highly differentiated sites relative to a set of non-immune loci, suggesting positive selection. We identify 245 variants that are highly differentiated within the London dataset, four of which were replicated in an independent cohort from Denmark, and represent the strongest candidates for positive selection. The selected allele for one of these variants, rs2549794, is associated with the production of a full-length (versus truncated) ERAP2 transcript, variation in cytokine response to Y. pestis and increased ability to control intracellular Y. pestis in macrophages. Finally, we show that protective variants overlap with alleles that are today associated with increased susceptibility to autoimmune diseases, providing empirical evidence for the role played by past pandemics in shaping present-day susceptibility to disease.
Asunto(s)
ADN Antiguo , Predisposición Genética a la Enfermedad , Inmunidad , Peste , Selección Genética , Yersinia pestis , Humanos , Aminopeptidasas/genética , Aminopeptidasas/inmunología , Peste/genética , Peste/inmunología , Peste/microbiología , Peste/mortalidad , Yersinia pestis/inmunología , Yersinia pestis/patogenicidad , Selección Genética/inmunología , Europa (Continente)/epidemiología , Europa (Continente)/etnología , Inmunidad/genética , Conjuntos de Datos como Asunto , Londres/epidemiología , Dinamarca/epidemiologíaRESUMEN
Unraveling the complex demographic histories of natural populations is a central problem in population genetics. Understanding past demographic events is of general anthropological interest, but is also an important step in establishing accurate null models when identifying adaptive or disease-associated genetic variation. An important class of tools for inferring past population size changes from genomic sequence data are Coalescent Hidden Markov Models (CHMMs). These models make efficient use of the linkage information in population genomic datasets by using the local genealogies relating sampled individuals as latent states that evolve along the chromosome in an HMM framework. Extending these models to large sample sizes is challenging, since the number of possible latent states increases rapidly. Here, we present our method CHIMP (CHMM History-Inference Maximum-Likelihood Procedure), a novel CHMM method for inferring the size history of a population. It can be applied to large samples (hundreds of haplotypes) and only requires unphased genomes as input. The two implementations of CHIMP that we present here use either the height of the genealogical tree (TMRCA) or the total branch length, respectively, as the latent variable at each position in the genome. The requisite transition and emission probabilities are obtained by numerically solving certain systems of differential equations derived from the ancestral process with recombination. The parameters of the population size history are subsequently inferred using an Expectation-Maximization algorithm. In addition, we implement a composite likelihood scheme to allow the method to scale to large sample sizes. We demonstrate the efficiency and accuracy of our method in a variety of benchmark tests using simulated data and present comparisons to other state-of-the-art methods. Specifically, our implementation using TMRCA as the latent variable shows comparable performance and provides accurate estimates of effective population sizes in intermediate and ancient times. Our method is agnostic to the phasing of the data, which makes it a promising alternative in scenarios where high quality data is not available, and has potential applications for pseudo-haploid data.
Asunto(s)
Genética de Población , Modelos Genéticos , Algoritmos , Simulación por Computador , Genómica , Humanos , Cadenas de Markov , Densidad de PoblaciónRESUMEN
Polygenic scores link the genotypes of ancient individuals to their phenotypes, which are often unobservable, offering a tantalizing opportunity to reconstruct complex trait evolution. In practice, however, interpretation of ancient polygenic scores is subject to numerous assumptions. For one, the genome-wide association (GWA) studies from which polygenic scores are derived, can only estimate effect sizes for loci segregating in contemporary populations. Therefore, a GWA study may not correctly identify all loci relevant to trait variation in the ancient population. In addition, the frequencies of trait-associated loci may have changed in the intervening years. Here, we devise a theoretical framework to quantify the effect of this allelic turnover on the statistical properties of polygenic scores as functions of population genetic dynamics, trait architecture, power to detect significant loci, and the age of the ancient sample. We model the allele frequencies of loci underlying trait variation using the Wright-Fisher diffusion, and employ the spectral representation of its transition density to find analytical expressions for several error metrics, including the expected sample correlation between the polygenic scores of ancient individuals and their true phenotypes, referred to as polygenic score accuracy. Our theory also applies to a two-population scenario and demonstrates that allelic turnover alone may explain a substantial percentage of the reduced accuracy observed in cross-population predictions, akin to those performed in human genetics. Finally, we use simulations to explore the effects of recent directional selection, a bias-inducing process, on the statistics of interest. We find that even in the presence of bias, weak selection induces minimal deviations from our neutral expectations for the decay of polygenic score accuracy. By quantifying the limitations of polygenic scores in an explicit evolutionary context, our work lays the foundation for the development of more sophisticated statistical procedures to analyze both temporally and geographically resolved polygenic scores.
Asunto(s)
Estudio de Asociación del Genoma Completo , Herencia Multifactorial , Alelos , Frecuencia de los Genes/genética , Estudio de Asociación del Genoma Completo/métodos , Modelos Genéticos , Herencia Multifactorial/genética , Selección GenéticaRESUMEN
Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversions <2 kbp form by twin-priming during L1 retrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 10-4 per locus per generation. Recurrent inversions exhibit a sex-chromosomal bias and co-localize with genomic disorder critical regions. We propose that inversion recurrence results in an elevated number of heterozygous carriers and structural SD diversity, which increases mutability in the population and predisposes specific haplotypes to disease-causing CNVs.
Asunto(s)
Inversión Cromosómica , Duplicaciones Segmentarias en el Genoma , Inversión Cromosómica/genética , Variaciones en el Número de Copia de ADN/genética , Genoma Humano , Genómica , HumanosRESUMEN
Archeogenetics has been revolutionary, revealing insights into demographic history and recent positive selection. However, most studies to date have ignored the nonrandom association of genetic variants at different loci (i.e. linkage disequilibrium). This may be in part because basic properties of linkage disequilibrium in samples from different times are still not well understood. Here, we derive several results for summary statistics of haplotypic variation under a model with time-stratified sampling: (1) The correlation between the number of pairwise differences observed between time-staggered samples (πΔt) in models with and without strict population continuity; (2) The product of the linkage disequilibrium coefficient, D, between ancient and modern samples, which is a measure of haplotypic similarity between modern and ancient samples; and (3) The expected switch rate in the Li and Stephens haplotype copying model. The latter has implications for genotype imputation and phasing in ancient samples with modern reference panels. Overall, these results provide a characterization of how haplotype patterns are affected by sample age, recombination rates, and population sizes. We expect these results will help guide the interpretation and analysis of haplotype data from ancient and modern samples.
Asunto(s)
Arqueología/métodos , Genética de Población/métodos , Genotipo , Haplotipos , Humanos , Desequilibrio de Ligamiento , Densidad de PoblaciónRESUMEN
Natural selection on beneficial or deleterious alleles results in an increase or decrease, respectively, of their frequency within the population. Due to chromosomal linkage, the dynamics of the selected site affect the genetic variation at nearby neutral loci in a process commonly referred to as genetic hitchhiking. Changes in population size, however, can yield patterns in genomic data that mimic the effects of selection. Accurately modeling these dynamics is thus crucial to understanding how selection and past population size changes impact observed patterns of genetic variation. Here, we model the evolution of haplotype frequencies with the Wright-Fisher diffusion to study the impact of selection on linked neutral variation. Explicit solutions are not known for the dynamics of this diffusion when selection and recombination act simultaneously. Thus, we present a method for numerically evaluating the Wright-Fisher diffusion dynamics of 2 linked loci separated by a certain recombination distance when selection is acting. We can account for arbitrary population size histories explicitly using this approach. A key step in the method is to express the moments of the associated transition density, or sampling probabilities, as solutions to ordinary differential equations. Numerically solving these differential equations relies on a novel accurate and numerically efficient technique to estimate higher order moments from lower order moments. We demonstrate how this numerical framework can be used to quantify the reduction and recovery of genetic diversity around a selected locus over time and elucidate distortions in the site-frequency-spectra of neutral variation linked to loci under selection in various demographic settings. The method can be readily extended to more general modes of selection and applied in likelihood frameworks to detect loci under selection and infer the strength of the selective pressure.
Asunto(s)
Modelos Genéticos , Selección Genética , Alelos , Ligamiento Genético , Variación Genética , Genética de PoblaciónRESUMEN
Parental relatedness of present-day humans varies substantially across the globe, but little is known about the past. Here we analyze ancient DNA, leveraging that parental relatedness leaves genomic traces in the form of runs of homozygosity. We present an approach to identify such runs in low-coverage ancient DNA data aided by haplotype information from a modern phased reference panel. Simulation and experiments show that this method robustly detects runs of homozygosity longer than 4 centimorgan for ancient individuals with at least 0.3 × coverage. Analyzing genomic data from 1,785 ancient humans who lived in the last 45,000 years, we detect low rates of first cousin or closer unions across most ancient populations. Moreover, we find a marked decay in background parental relatedness co-occurring with or shortly after the advent of sedentary agriculture. We observe this signal, likely linked to increasing local population sizes, across several geographic transects worldwide.
Asunto(s)
ADN Antiguo/análisis , Genoma Humano , Haplotipos , Homocigoto , Patrón de Herencia , Dinámica Poblacional/historia , Agricultura/historia , Femenino , Historia Antigua , Humanos , MasculinoRESUMEN
Analyzing ancient DNA of the central Andes, Ringbauer and colleagues identify a markedly elevated rate of unions of closely related parents after ca. 1000 CE. This change of mating preferences sheds new light on a unique system of social organization based on ancestry ("ayllu") whereby within-group unions were preferred to facilitate sharing of resources.
Asunto(s)
ADN Antiguo/análisis , Endogamia/historia , Endogamia/métodos , Reproducción , Historia Antigua , Historia Medieval , Humanos , América del SurRESUMEN
There has been much interest in analyzing genome-scale DNA sequence data to infer population histories, but inference methods developed hitherto are limited in model complexity and computational scalability. Here we present an efficient, flexible statistical method, diCal2, that can use whole-genome sequence data from multiple populations to infer complex demographic models involving population size changes, population splits, admixture, and migration. Applying our method to data from Australian, East Asian, European, and Papuan populations, we find that the population ancestral to Australians and Papuans started separating from East Asians and Europeans about 100,000 y ago, and that the separation of East Asians and Europeans started about 50,000 y ago, with pervasive gene flow between all pairs of populations.
Asunto(s)
Flujo Génico , Estudio de Asociación del Genoma Completo , Migración Humana , Modelos Genéticos , Nativos de Hawái y Otras Islas del Pacífico/genética , Secuenciación Completa del Genoma , Australia , Genética de Población , Historia Antigua , Humanos , Nativos de Hawái y Otras Islas del Pacífico/historiaRESUMEN
Studying how diverse human populations are related is of historical and anthropological interest, in addition to providing a realistic null model for testing for signatures of natural selection or disease associations. Furthermore, understanding the demographic histories of other species is playing an increasingly important role in conservation genetics. A number of statistical methods have been developed to infer population demographic histories using whole-genome sequence data, with recent advances focusing on allowing for more flexible modeling choices, scaling to larger data sets, and increasing statistical power. Here we review coalescent hidden Markov models, a powerful class of population genetic inference methods that can utilize linkage disequilibrium information effectively. We highlight recent advances, give advice for practitioners, point out potential pitfalls, and present possible future research directions.
Asunto(s)
Evolución Molecular , Genética de Población , Selección Genética/genética , Genoma Humano/genética , Humanos , Cadenas de Markov , Secuenciación Completa del GenomaRESUMEN
Genetic evidence has revealed that the ancestors of modern human populations outside Africa and their hominin sister groups, notably Neanderthals, exchanged genetic material in the past. The distribution of these introgressed sequence tracts along modern-day human genomes provides insight into the selective forces acting on them and the role of introgression in the evolutionary history of hominins. Studying introgression patterns on the X-chromosome is of particular interest, as sex chromosomes are thought to play a special role in speciation. Recent studies have developed methods to localize introgressed ancestries, reporting long regions that are depleted of Neanderthal introgression and enriched in genes, suggesting negative selection against the Neanderthal variants. On the other hand, enriched Neanderthal ancestry in hair- and skin-related genes suggests that some introgressed variants facilitated adaptation to new environments. Here, we present a model-based introgression detection method called dical-admix. We demonstrate its efficiency and accuracy through extensive simulations and apply it to detect tracts of Neanderthal introgression in modern human individuals from the 1000 Genomes Project. Our findings are largely concordant with previous studies, consistent with weak selection against Neanderthal ancestry. We find evidence that selection against Neanderthal ancestry was due to higher genetic load in Neanderthals resulting from small effective population size, rather than widespread Dobzhansky-Müller incompatibilities (DMIs) that could contribute to reproductive isolation. Moreover, we confirm the previously reported low level of introgression on the X-chromosome, but find little evidence that DMIs contributed to this pattern.
Asunto(s)
Genética de Población , Genoma Humano , Modelos Genéticos , Hombre de Neandertal/genética , Animales , Cromosomas Humanos X/genética , Simulación por Computador , Carga Genética , Humanos , Hibridación Genética , Cadenas de Markov , Densidad de Población , Selección GenéticaRESUMEN
Despite broad agreement that the Americas were initially populated via Beringia, the land bridge that connected far northeast Asia with northwestern North America during the Pleistocene epoch, when and how the peopling of the Americas occurred remains unresolved. Analyses of human remains from Late Pleistocene Alaska are important to resolving the timing and dispersal of these populations. The remains of two infants were recovered at Upward Sun River (USR), and have been dated to around 11.5 thousand years ago (ka). Here, by sequencing the USR1 genome to an average coverage of approximately 17 times, we show that USR1 is most closely related to Native Americans, but falls basal to all previously sequenced contemporary and ancient Native Americans. As such, USR1 represents a distinct Ancient Beringian population. Using demographic modelling, we infer that the Ancient Beringian population and ancestors of other Native Americans descended from a single founding population that initially split from East Asians around 36 ± 1.5 ka, with gene flow persisting until around 25 ± 1.1 ka. Gene flow from ancient north Eurasians into all Native Americans took place 25-20 ka, with Ancient Beringians branching off around 22-18.1 ka. Our findings support a long-term genetic structure in ancestral Native Americans, consistent with the Beringian 'standstill model'. We show that the basal northern and southern Native American branches, to which all other Native Americans belong, diverged around 17.5-14.6 ka, and that this probably occurred south of the North American ice sheets. We also show that after 11.5 ka, some of the northern Native American populations received gene flow from a Siberian population most closely related to Koryaks, but not Palaeo-Eskimos, Inuits or Kets, and that Native American gene flow into Inuits was through northern and not southern Native American groups. Our findings further suggest that the far-northern North American presence of northern Native Americans is from a back migration that replaced or absorbed the initial founding population of Ancient Beringians.
Asunto(s)
Efecto Fundador , Genoma Humano/genética , Indígenas Norteamericanos/genética , Modelos Genéticos , Filogenia , Alaska , Asia Oriental/etnología , Flujo Génico , Genética de Población , Historia Antigua , Migración Humana , Humanos , Lactante , Ríos , Siberia/etnología , Factores de TiempoRESUMEN
In recent years, a number of methods have been developed to infer complex demographic histories, especially historical population size changes, from genomic sequence data. Coalescent Hidden Markov Models have proven to be particularly useful for this type of inference. Due to the Markovian structure of these models, an essential building block is the joint distribution of local genealogical trees, or statistics of these genealogies, at two neighboring loci in populations of variable size. Here, we present a novel method to compute the marginal and the joint distribution of the total length of the genealogical trees at two loci separated by at most one recombination event for samples of arbitrary size. To our knowledge, no method to compute these distributions has been presented in the literature to date. We show that they can be obtained from the solution of certain hyperbolic systems of partial differential equations. We present a numerical algorithm, based on the method of characteristics, that can be used to efficiently and accurately solve these systems and compute the marginal and the joint distributions. We demonstrate its utility to study the properties of the joint distribution. Our flexible method can be straightforwardly extended to handle an arbitrary fixed number of recombination events, to include the distributions of other statistics of the genealogies as well, and can also be applied in structured populations.
Asunto(s)
Linaje , Densidad de Población , Humanos , Cadenas de Markov , Recombinación GenéticaRESUMEN
Many approaches have been developed for inferring selection coefficients from time series data while accounting for genetic drift. These approaches have been motivated by the intuition that properly accounting for the population size history can significantly improve estimates of selective strengths. However, the improvement in inference accuracy that can be attained by modeling drift has not been characterized. Here, by comparing maximum likelihood estimates of selection coefficients that account for the true population size history with estimates that ignore drift by assuming allele frequencies evolve deterministically in a population of infinite size, we address the following questions: how much can modeling the population size history improve estimates of selection coefficients? How much can mis-inferred population sizes hurt inferences of selection coefficients? We conduct our analysis under the discrete Wright-Fisher model by deriving the exact probability of an allele frequency trajectory in a population of time-varying size and we replicate our results under the diffusion model. For both models, we find that ignoring drift leads to estimates of selection coefficients that are nearly as accurate as estimates that account for the true population history, even when population sizes are small and drift is high. This result is of interest because inference methods that ignore drift are widely used in evolutionary studies and can be many orders of magnitude faster than methods that account for population sizes.
Asunto(s)
Genética de Población/métodos , Modelos Genéticos , Selección Genética , Evolución Biológica , Simulación por Computador , Frecuencia de los Genes , Flujo Genético , Funciones de Verosimilitud , Densidad de PoblaciónRESUMEN
MOTIVATION: In the Wright-Fisher diffusion, the transition density function describes the time evolution of the population-wide frequency of an allele. This function has several practical applications in population genetics and computing it for biologically realistic scenarios with selection and demography is an important problem. RESULTS: We develop an efficient method for finding a spectral representation of the transition density function for a general model where the effective population size, selection coefficients and mutation parameters vary over time in a piecewise constant manner. AVAILABILITY AND IMPLEMENTATION: The method, called SpectralTDF, is available at https://sourceforge.net/projects/spectraltdf/ CONTACT: yss@berkeley.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.