RESUMO
Despite broad agreement that Homo sapiens originated in Africa, considerable uncertainty surrounds specific models of divergence and migration across the continent1. Progress is hampered by a shortage of fossil and genomic data, as well as variability in previous estimates of divergence times1. Here we seek to discriminate among such models by considering linkage disequilibrium and diversity-based statistics, optimized for rapid, complex demographic inference2. We infer detailed demographic models for populations across Africa, including eastern and western representatives, and newly sequenced whole genomes from 44 Nama (Khoe-San) individuals from southern Africa. We infer a reticulated African population history in which present-day population structure dates back to Marine Isotope Stage 5. The earliest population divergence among contemporary populations occurred 120,000 to 135,000 years ago and was preceded by links between two or more weakly differentiated ancestral Homo populations connected by gene flow over hundreds of thousands of years. Such weakly structured stem models explain patterns of polymorphism that had previously been attributed to contributions from archaic hominins in Africa2-7. In contrast to models with archaic introgression, we predict that fossil remains from coexisting ancestral populations should be genetically and morphologically similar, and that only an inferred 1-4% of genetic differentiation among contemporary human populations can be attributed to genetic drift between stem populations. We show that model misspecification explains the variation in previous estimates of divergence times, and argue that studying a range of models is key to making robust inferences about deep history.
Assuntos
Genética Populacional , Migração Humana , Filogenia , Humanos , África/etnologia , Fósseis , Fluxo Gênico , Deriva Genética , Introgressão Genética , Genoma Humano , História Antiga , Migração Humana/história , Desequilíbrio de Ligação/genética , Polimorfismo Genético , Fatores de TempoRESUMO
Latin America continues to be severely underrepresented in genomics research, and fine-scale genetic histories and complex trait architectures remain hidden owing to insufficient data1. To fill this gap, the Mexican Biobank project genotyped 6,057 individuals from 898 rural and urban localities across all 32 states in Mexico at a resolution of 1.8 million genome-wide markers with linked complex trait and disease information creating a valuable nationwide genotype-phenotype database. Here, using ancestry deconvolution and inference of identity-by-descent segments, we inferred ancestral population sizes across Mesoamerican regions over time, unravelling Indigenous, colonial and postcolonial demographic dynamics2-6. We observed variation in runs of homozygosity among genomic regions with different ancestries reflecting distinct demographic histories and, in turn, different distributions of rare deleterious variants. We conducted genome-wide association studies (GWAS) for 22 complex traits and found that several traits are better predicted using the Mexican Biobank GWAS compared to the UK Biobank GWAS7,8. We identified genetic and environmental factors associating with trait variation, such as the length of the genome in runs of homozygosity as a predictor for body mass index, triglycerides, glucose and height. This study provides insights into the genetic histories of individuals in Mexico and dissects their complex trait architectures, both crucial for making precision and preventive medicine initiatives accessible worldwide.
Assuntos
Bancos de Espécimes Biológicos , Genética Médica , Genoma Humano , Genômica , Hispânico ou Latino , Humanos , Glicemia/genética , Glicemia/metabolismo , Estatura/genética , Índice de Massa Corporal , Interação Gene-Ambiente , Marcadores Genéticos/genética , Estudo de Associação Genômica Ampla , Hispânico ou Latino/classificação , Hispânico ou Latino/genética , Homozigoto , México , Fenótipo , Triglicerídeos/sangue , Triglicerídeos/genética , Reino Unido , Genoma Humano/genéticaRESUMO
Demographic models of Latin American populations often fail to fully capture their complex evolutionary history, which has been shaped by both recent admixture and deeper-in-time demographic events. To address this gap, we used high-coverage whole-genome data from Indigenous American ancestries in present-day Mexico and existing genomes from across Latin America to infer multiple demographic models that capture the impact of different timescales on genetic diversity. Our approach, which combines analyses of allele frequencies and ancestry tract length distributions, represents a significant improvement over current models in predicting patterns of genetic variation in admixed Latin American populations. We jointly modeled the contribution of European, African, East Asian, and Indigenous American ancestries into present-day Latin American populations. We infer that the ancestors of Indigenous Americans and East Asians diverged â¼30 thousand years ago, and we characterize genetic contributions of recent migrations from East and Southeast Asia to Peru and Mexico. Our inferred demographic histories are consistent across different genomic regions and annotations, suggesting that our inferences are robust to the potential effects of linked selection. In conjunction with published distributions of fitness effects for new nonsynonymous mutations in humans, we show in large-scale simulations that our models recover important features of both neutral and deleterious variation. By providing a more realistic framework for understanding the evolutionary history of Latin American populations, our models can help address the historical under-representation of admixed groups in genomics research and can be a valuable resource for future studies of populations with complex admixture and demographic histories.
Assuntos
Genética Populacional , Genoma Humano , Humanos , América Latina , Genoma Humano/genética , Demografia , BrancosRESUMO
Wang et al. (2023) recently proposed an approach to infer the history of human generation intervals from changes in mutation profiles over time. As the relative proportions of different mutation types depend on the ages of parents, binning variants by the time they arose allows for the inference of changes in average paternal and maternal generation intervals. Applying this approach to published allele age estimates, Wang et al. (2023) inferred long-lasting sex differences in average generation times and surprisingly found that ancestral generation times of West African populations remained substantially higher than those of Eurasian populations extending tens of thousands of generations into the past. Here, we argue that the results and interpretations in Wang et al. (2023) are primarily driven by noise and biases in input data and a lack of validation using independent approaches for estimating allele ages. With the recent development of methods to reconstruct genome-wide gene genealogies, coalescence times, and allele ages, we caution that downstream analyses may be strongly influenced by uncharacterized biases in their output.
Assuntos
Incerteza , Humanos , Feminino , Masculino , Mutação , AlelosRESUMO
As populations boom and bust, the accumulation of genetic diversity is modulated, encoding histories of living populations in present-day variation. Many methods exist to decode these histories, and all must make strong model assumptions. It is typical to assume that mutations accumulate uniformly across the genome at a constant rate that does not vary between closely related populations. However, recent work shows that mutational processes in human and great ape populations vary across genomic regions and evolve over time. This perturbs the mutation spectrum (relative mutation rates in different local nucleotide contexts). Here, we develop theoretical tools in the framework of Kingman's coalescent to accommodate mutation spectrum dynamics. We present mutation spectrum history inference (mushi), a method to perform nonparametric inference of demographic and mutation spectrum histories from allele frequency data. We use mushi to reconstruct trajectories of effective population size and mutation spectrum divergence between human populations, identify mutation signatures and their dynamics in different human populations, and calibrate the timing of a previously reported mutational pulse in the ancestors of Europeans. We show that mutation spectrum histories can be placed in a well-studied theoretical setting and rigorously inferred from genomic variation data, like other features of evolutionary history.
Assuntos
Frequência do Gene/genética , Genética Populacional/estatística & dados numéricos , Modelos Genéticos , Mutação/genética , Animais , Variação Genética/genética , Genômica , Hominidae/genética , Humanos , Taxa de Mutação , Densidade DemográficaRESUMO
Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.
Assuntos
Genética Populacional/tendências , Genoma Humano , Modelos Genéticos , Software , Algoritmos , Simulação por Computador , Demografia , Variação Genética , Genética Populacional/história , História Antiga , Migração Humana/história , Migração Humana/estatística & dados numéricos , HumanosRESUMO
Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.
Assuntos
Algoritmos , Sequência de Bases/fisiologia , Genética Populacional/métodos , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Estudos de Coortes , Simulação por Computador , Evolução Molecular , Genoma/genética , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Desequilíbrio de Ligação , Recombinação Genética/fisiologia , Tamanho da AmostraRESUMO
Linkage-Disequilibrium Score Regression (LDSC) is a popular framework for analyzing Genome-wide Association Studies (GWAS) summary statistics that allows for estimating single nucleotide polymorphism heritability, confounding, and functional enrichment of genetic variants with different annotations. Recent work has highlighted the influence of implicit and explicit assumptions of the model on the biological interpretation of the results. In this study, we explored a formulation of LDSC that replaces the r2 measure of LD with a recently proposed unbiased estimator of the D2 statistic. In addition to modest statistical difference across estimators, this derivation highlighted implicit and unrealistic assumptions about the relationship between allele frequency, effect size, and annotation status. We carry out a systematic comparison of alternative LDSC formulations by applying them to summary statistics from 47 GWAS traits. Our results show that commonly used models likely underestimate functional enrichment. These results highlight the importance of calibrating the LDSC model to achieve a more robust understanding of polygenic traits.
Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Desequilíbrio de Ligação , Modelos Genéticos , Polimorfismo de Nucleotídeo ÚnicoRESUMO
The study of domestication contributes to our knowledge of evolution and crop genetic resources. Human selection has shaped wild Brassica rapa into diverse turnip, leafy, and oilseed crops. Despite its worldwide economic importance and potential as a model for understanding diversification under domestication, insights into the number of domestication events and initial crop(s) domesticated in B. rapa have been limited due to a lack of clarity about the wild or feral status of conspecific noncrop relatives. To address this gap and reconstruct the domestication history of B. rapa, we analyzed 68,468 genotyping-by-sequencing-derived single nucleotide polymorphisms for 416 samples in the largest diversity panel of domesticated and weedy B. rapa to date. To further understand the center of origin, we modeled the potential range of wild B. rapa during the mid-Holocene. Our analyses of genetic diversity across B. rapa morphotypes suggest that noncrop samples from the Caucasus, Siberia, and Italy may be truly wild, whereas those occurring in the Americas and much of Europe are feral. Clustering, tree-based analyses, and parameterized demographic inference further indicate that turnips were likely the first crop type domesticated, from which leafy types in East Asia and Europe were selected from distinct lineages. These findings clarify the domestication history and nature of wild crop genetic resources for B. rapa, which provides the first step toward investigating cases of possible parallel selection, the domestication and feralization syndrome, and novel germplasm for Brassica crop improvement.
Assuntos
Brassica rapa/genética , Produtos Agrícolas/genética , Domesticação , Modelos Genéticos , Plantas Daninhas/genética , Introgressão Genética , Variação Genética , Técnicas de Genotipagem , Filogeografia , Seleção GenéticaRESUMO
The effect of a mutation on fitness may differ between populations depending on environmental and genetic context, but little is known about the factors that underlie such differences. To quantify genome-wide correlations in mutation fitness effects, we developed a novel concept called a joint distribution of fitness effects (DFE) between populations. We then proposed a new statistic w to measure the DFE correlation between populations. Using simulation, we showed that inferring the DFE correlation from the joint allele frequency spectrum is statistically precise and robust. Using population genomic data, we inferred DFE correlations of populations in humans, Drosophila melanogaster, and wild tomatoes. In these species, we found that the overall correlation of the joint DFE was inversely related to genetic differentiation. In humans and D. melanogaster, deleterious mutations had a lower DFE correlation than tolerated mutations, indicating a complex joint DFE. Altogether, the DFE correlation can be reliably inferred, and it offers extensive insight into the genetics of population divergence.
Assuntos
Drosophila melanogaster , Aptidão Genética , Animais , Drosophila melanogaster/genética , Frequência do Gene , Genoma , Modelos Genéticos , MutaçãoRESUMO
We learn about population history and underlying evolutionary biology through patterns of genetic polymorphism. Many approaches to reconstruct evolutionary histories focus on a limited number of informative statistics describing distributions of allele frequencies or patterns of linkage disequilibrium. We show that many commonly used statistics are part of a broad family of two-locus moments whose expectation can be computed jointly and rapidly under a wide range of scenarios, including complex multi-population demographies with continuous migration and admixture events. A full inspection of these statistics reveals that widely used models of human history fail to predict simple patterns of linkage disequilibrium. To jointly capture the information contained in classical and novel statistics, we implemented a tractable likelihood-based inference framework for demographic history. Using this approach, we show that human evolutionary models that include archaic admixture in Africa, Asia, and Europe provide a much better description of patterns of genetic diversity across the human genome. We estimate that an unidentified, deeply diverged population admixed with modern humans within Africa both before and after the split of African and Eurasian populations, contributing 4 - 8% genetic ancestry to individuals in world-wide populations.
Assuntos
Evolução Molecular , Genética Populacional , Genoma Humano/genética , Hominidae/genética , África/epidemiologia , Animais , Ásia/epidemiologia , População Negra/genética , Europa (Continente)/epidemiologia , Fluxo Gênico/genética , Frequência do Gene , Humanos , Funções Verossimilhança , Desequilíbrio de Ligação , Modelos Genéticos , Polimorfismo Genético/genéticaRESUMO
Linkage disequilibrium (LD) is used to infer evolutionary history, to identify genomic regions under selection, and to dissect the relationship between genotype and phenotype. In each case, we require accurate estimates of LD statistics from sequencing data. Unphased data present a challenge because multilocus haplotypes cannot be inferred exactly. Widely used estimators for the common statistics r2 and D2 exhibit large and variable upward biases that complicate interpretation and comparison across cohorts. Here, we show how to find unbiased estimators for a wide range of two-locus statistics, including D2, for both single and multiple randomly mating populations. These unbiased statistics are particularly well suited to estimate effective population sizes from unlinked loci in small populations. We develop a simple inference pipeline and use it to refine estimates of recent effective population sizes of the threatened Channel Island Fox populations.
Assuntos
Biologia Computacional/métodos , Raposas/genética , Animais , Frequência do Gene , Genética Populacional , Genótipo , Haplótipos , Desequilíbrio de Ligação , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo Único , Densidade Demográfica , Seleção GenéticaRESUMO
PREMISE: Polyploid species often have complex evolutionary histories that have, until recently, been intractable due to limitations of genomic resources. While recent work has further uncovered the evolutionary history of the octoploid strawberry (Fragaria L.), there are still open questions. Much is unknown about the evolutionary relationship of the wild octoploid species, Fragaria virginiana and Fragaria chiloensis, and gene flow within and among species after the formation of the octoploid genome. METHODS: We leveraged a collection of wild octoploid ecotypes of strawberry representing the recognized subspecies and ranging from Alaska to southern Chile, and a high-density SNP array to investigate wild octoploid strawberry evolution. Evolutionary relationships were interrogated with phylogenetic analysis and genetic clustering algorithms. Additionally, admixture among and within species is assessed with model-based and tree-based approaches. RESULTS: Phylogenetic analysis revealed that the two octoploid strawberry species are monophyletic sister lineages. The genetic clustering results show substructure between North and South American F. chiloensis populations. Additionally, model-based and tree-based methods support gene flow within and among the two octoploid species, including newly identified admixture in the Hawaiian F. chiloensis subsp. sandwicensis population. CONCLUSIONS: F. virginiana and F. chiloensis are supported as monophyletic and sister lineages. All but one of the subspecies show extensive paraphyly. Furthermore, phylogenetic relationships among F. chiloensis populations supports a single population range expansion southward from North America. The inter- and intraspecific relationships of octoploid strawberry are complex and suggest substantial gene flow between sympatric populations among and within species.
Assuntos
Fragaria , América , Fragaria/genética , Genoma de Planta , Filogenia , PoliploidiaRESUMO
Demographic modelling is often used with population genomic data to infer the relationships and ages among populations. However, relatively few analyses are able to validate these inferences with independent data. Here, we leverage written records that describe distinct Brassica rapa crops to corroborate demographic models of domestication. Brassica rapa crops are renowned for their outstanding morphological diversity, but the relationships and order of domestication remain unclear. We generated genomewide SNPs from 126 accessions collected globally using high-throughput transcriptome data. Analyses of more than 31,000 SNPs across the B. rapa genome revealed evidence for five distinct genetic groups and supported a European-Central Asian origin of B. rapa crops. Our results supported the traditionally recognized South Asian and East Asian B. rapa groups with evidence that pak choi, Chinese cabbage and yellow sarson are likely monophyletic groups. In contrast, the oil-type B. rapa subsp. oleifera and brown sarson were polyphyletic. We also found no evidence to support the contention that rapini is the wild type or the earliest domesticated subspecies of B. rapa. Demographic analyses suggested that B. rapa was introduced to Asia 2,400-4,100 years ago, and that Chinese cabbage originated 1,200-2,100 years ago via admixture of pak choi and European-Central Asian B. rapa. We also inferred significantly different levels of founder effect among the B. rapa subspecies. Written records from antiquity that document these crops are consistent with these inferences. The concordance between our age estimates of domestication events with historical records provides unique support for our demographic inferences.
Assuntos
Brassica rapa/genética , Domesticação , Melhoramento Vegetal , Ásia , Documentação , Efeito Fundador , Polimorfismo de Nucleotídeo Único , TranscriptomaRESUMO
Many phenotypic traits have a polygenic genetic basis, making it challenging to learn their genetic architectures and predict individual phenotypes. One promising avenue to resolve the genetic basis of complex traits is through evolve-and-resequence experiments, in which laboratory populations are exposed to some selective pressure and trait-contributing loci are identified by extreme frequency changes over the course of the experiment. However, small laboratory populations will experience substantial random genetic drift, and it is difficult to determine whether selection played a roll in a given allele frequency change. Predicting how much allele frequencies change under drift and selection had remained an open problem well into the 21st century, even those contributing to simple, monogenic traits. Recently, there have been efforts to apply the path integral, a method borrowed from physics, to solve this problem. So far, this approach has been limited to genic selection, and is therefore inadequate to capture the complexity of quantitative, highly polygenic traits that are commonly studied. Here we extend one of these path integral methods, the perturbation approximation, to selection scenarios that are of interest to quantitative genetics. In particular, we derive analytic expressions for the transition probability (i.e., the probability that an allele will change in frequency from x , to y in time t ) of an allele contributing to a trait subject to stabilizing selection, as well as that of an allele contributing to a trait rapidly adapting to a new phenotypic optimum. We use these expressions to characterize the use of allele frequency change to test for selection, as well as explore optimal design choices for evolve-and-resequence experiments to uncover the genetic architecture of polygenic traits under selection.
RESUMO
Human and Neanderthal populations met and mixed on multiple occasions over evolutionary time, resulting in the exchange of genetic material. New genomic analyses of diverse African populations reveal a history of bidirectional gene flow and selection acting on introgressed alleles.
Assuntos
Evolução Molecular , Genoma Humano , Homem de Neandertal , Animais , Humanos , Alelos , Fluxo Gênico , Genômica , Homem de Neandertal/genética , Seleção Genética , População AfricanaRESUMO
Twentieth century industrial whaling pushed several species to the brink of extinction, with fin whales being the most impacted. However, a small, resident population in the Gulf of California was not targeted by whaling. Here, we analyzed 50 whole-genomes from the Eastern North Pacific (ENP) and Gulf of California (GOC) fin whale populations to investigate their demographic history and the genomic effects of natural and human-induced bottlenecks. We show that the two populations diverged ~16,000 years ago, after which the ENP population expanded and then suffered a 99% reduction in effective size during the whaling period. In contrast, the GOC population remained small and isolated, receiving less than one migrant per generation. However, this low level of migration has been crucial for maintaining its viability. Our study exposes the severity of whaling, emphasizes the importance of migration, and demonstrates the use of genome-based analyses and simulations to inform conservation strategies.
Assuntos
Baleia Comum , Humanos , Animais , Genômica , IndústriasRESUMO
Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.
Assuntos
Genoma , Software , Simulação por Computador , Genética Populacional , GenômicaRESUMO
Selected mutations interfere and interact with evolutionary processes at nearby loci, distorting allele frequency trajectories and creating correlations between pairs of mutations. Recent studies have used patterns of linkage disequilibrium between selected variants to test for selective interference and epistatic interactions, with some disagreement over interpreting observations from data. Interpretation is hindered by a lack of analytic or even numerical expectations for patterns of variation between pairs of loci under the combined effects of selection, dominance, epistasis, and demography. Here, I develop a numerical approach to compute the expected two-locus sampling distribution under diploid selection with arbitrary epistasis and dominance, recombination, and variable population size. I use this to explore how epistasis and dominance affect expected signed linkage disequilibrium, including for nonsteady-state demography relevant to human populations. Using whole-genome sequencing data from humans, I explore genome-wide patterns of linkage disequilibrium within protein-coding genes. I show that positive linkage disequilibrium between missense mutations within genes is driven by strong positive allele-frequency correlations between mutations that fall within the same annotated conserved domain, pointing to compensatory mutations or antagonistic epistasis as the prevailing mode of interaction within conserved genic elements. Linkage disequilibrium between missense mutations is reduced outside of conserved domains, as expected under Hill-Robertson interference. This variation in both mutational fitness effects and selective interactions within protein-coding genes calls for more refined inferences of the joint distribution of fitness and interactive effects, and the methods presented here should prove useful in that pursuit.