RESUMO
Across the human genome, there are large-scale fluctuations in genetic diversity caused by the indirect effects of selection. This "linked selection signal" reflects the impact of selection according to the physical placement of functional regions and recombination rates along chromosomes. Previous work has shown that purifying selection acting against the steady influx of new deleterious mutations at functional portions of the genome shapes patterns of genomic variation. To date, statistical efforts to estimate purifying selection parameters from linked selection models have relied on classic Background Selection theory, which is only applicable when new mutations are so deleterious that they cannot fix in the population. Here, we develop a statistical method based on a quantitative genetics view of linked selection, that models how polygenic additive fitness variance distributed along the genome increases the rate of stochastic allele frequency change. By jointly predicting the equilibrium fitness variance and substitution rate due to both strong and weakly deleterious mutations, we estimate the distribution of fitness effects (DFE) and mutation rate across three geographically distinct human samples. While our model can accommodate weaker selection, we find evidence of strong selection operating similarly across all human samples. Although our quantitative genetic model of linked selection fits better than previous models, substitution rates of the most constrained sites disagree with observed divergence levels. We find that a model incorporating selective interference better predicts observed divergence in conserved regions, but overall our results suggest uncertainty remains about the processes generating fitness variation in humans.
Assuntos
Modelos Genéticos , Seleção Genética , Humanos , Evolução Molecular , Frequência do Gene/genética , Mutação , Genoma Humano/genética , Variação Genética , Aptidão GenéticaRESUMO
Species distributed across heterogeneous environments often evolve locally adapted ecotypes, but understanding of the genetic mechanisms involved in their formation and maintenance in the face of gene flow is incomplete. In Burkina Faso, the major African malaria mosquito Anopheles funestus comprises two strictly sympatric and morphologically indistinguishable yet karyotypically differentiated forms reported to differ in ecology and behavior. However, knowledge of the genetic basis and environmental determinants of An. funestus diversification was impeded by lack of modern genomic resources. Here, we applied deep whole-genome sequencing and analysis to test the hypothesis that these two forms are ecotypes differentially adapted to breeding in natural swamps versus irrigated rice fields. We demonstrate genome-wide differentiation despite extensive microsympatry, synchronicity, and ongoing hybridization. Demographic inference supports a split only ~1,300 y ago, closely following the massive expansion of domesticated African rice cultivation ~1,850 y ago. Regions of highest divergence, concentrated in chromosomal inversions, were under selection during lineage splitting, consistent with local adaptation. The origin of nearly all variations implicated in adaptation, including chromosomal inversions, substantially predates the ecotype split, suggesting that rapid adaptation was fueled mainly by standing genetic variation. Sharp inversion frequency differences likely facilitated adaptive divergence between ecotypes by suppressing recombination between opposing chromosomal orientations of the two ecotypes, while permitting free recombination within the structurally monomorphic rice ecotype. Our results align with growing evidence from diverse taxa that rapid ecological diversification can arise from evolutionarily old structural genetic variants that modify genetic recombination.
Assuntos
Anopheles , Malária , Oryza , Animais , Inversão Cromossômica , Ecótipo , Melhoramento Vegetal , Anopheles/genética , Oryza/genéticaRESUMO
Spatial genetic variation is shaped in part by an organism's dispersal ability. We present a deep learning tool, disperseNN2, for estimating the mean per-generation dispersal distance from georeferenced polymorphism data. Our neural network performs feature extraction on pairs of genotypes, and uses the geographic information that comes with each sample. These attributes led disperseNN2 to outperform a state-of-the-art deep learning method that does not use explicit spatial information: the mean relative absolute error was reduced by 33% and 48% using sample sizes of 10 and 100 individuals, respectively. disperseNN2 is particularly useful for non-model organisms or systems with sparse genomic resources, as it uses unphased, single nucleotide polymorphisms as its input. The software is open source and available from https://github.com/kr-colab/disperseNN2 , with documentation located at https://dispersenn2.readthedocs.io/en/latest/ .
Assuntos
Redes Neurais de Computação , Software , Humanos , Genômica/métodos , Genoma , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce partialS/HIC, a deep learning method to discover selective sweeps from population genomic data. partialS/HIC uses a convolutional neural network for image processing, which is trained with a large suite of summary statistics derived from coalescent simulations incorporating population-specific history, to distinguish between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate partialS/HIC's performance, which exhibits excellent resolution for detecting partial sweeps. We also apply our classifier to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a list of candidate adaptive loci that may be relevant to mosquito control efforts. More broadly, our supervised machine learning approach introduces a method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics.
Assuntos
Anopheles/genética , Aprendizado Profundo , Resistência a Inseticidas/genética , Seleção Genética , Animais , Genoma de InsetoRESUMO
Accurately inferring the genome-wide landscape of recombination rates in natural populations is a central aim in genomics, as patterns of linkage influence everything from genetic mapping to understanding evolutionary history. Here, we describe recombination landscape estimation using recurrent neural networks (ReLERNN), a deep learning method for estimating a genome-wide recombination map that is accurate even with small numbers of pooled or individually sequenced genomes. Rather than use summaries of linkage disequilibrium as its input, ReLERNN takes columns from a genotype alignment, which are then modeled as a sequence across the genome using a recurrent neural network. We demonstrate that ReLERNN improves accuracy and reduces bias relative to existing methods and maintains high accuracy in the face of demographic model misspecification, missing genotype calls, and genome inaccessibility. We apply ReLERNN to natural populations of African Drosophila melanogaster and show that genome-wide recombination landscapes, although largely correlated among populations, exhibit important population-specific differences. Lastly, we connect the inferred patterns of recombination with the frequencies of major inversions segregating in natural Drosophila populations.
Assuntos
Aprendizado Profundo , Genômica/métodos , Recombinação Genética , Animais , Drosophila melanogasterRESUMO
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.
Assuntos
Mineração de Dados/métodos , Genética Populacional , Genoma Humano , Aprendizado de Máquina Supervisionado , Evolução Biológica , Conjuntos de Dados como Assunto , Humanos , Seleção GenéticaRESUMO
Hybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia.
Assuntos
Drosophila simulans/genética , Drosophila/genética , Genoma de Inseto , Aprendizado de Máquina Supervisionado , Animais , Simulação por Computador , Drosophila/classificação , Drosophila simulans/classificação , Evolução Molecular , Fluxo Gênico , Especiação Genética , Variação Genética , Genética Populacional , Haplótipos , Hibridização Genética , Modelos Genéticos , Software , Especificidade da Espécie , Aprendizado de Máquina Supervisionado/estatística & dados numéricosRESUMO
Evidence for adaptation to different climates in the model species Arabidopsis thaliana is seen in reciprocal transplant experiments, but the genetic basis of this adaptation remains poorly understood. Field-based quantitative trait locus (QTL) studies provide direct but low-resolution evidence for the genetic basis of local adaptation. Using high-resolution population genomic approaches, we examine local adaptation along previously identified genetic trade-off (GT) and conditionally neutral (CN) QTLs for fitness between locally adapted Italian and Swedish A. thaliana populations [Ågren J, et al. (2013) Proc Natl Acad Sci USA 110:21077-21082]. We find that genomic regions enriched in high FST SNPs colocalize with GT QTL peaks. Many of these high FST regions also colocalize with regions enriched for SNPs significantly correlated to climate in Eurasia and evidence of recent selective sweeps in Sweden. Examining unfolded site frequency spectra across genes containing high FST SNPs suggests GTs may be due to more recent adaptation in Sweden than Italy. Finally, we collapse a list of thousands of genes spanning GT QTLs to 42 genes that likely underlie the observed GTs and explore potential biological processes driving these trade-offs, from protein phosphorylation, to seed dormancy and longevity. Our analyses link population genomic analyses and field-based QTL studies of local adaptation, and emphasize that GTs play an important role in the process of local adaptation.
Assuntos
Adaptação Fisiológica/genética , Arabidopsis/genética , Genoma de Planta , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Itália , SuéciaRESUMO
In this perspective, we evaluate the explanatory power of the neutral theory of molecular evolution, 50 years after its introduction by Kimura. We argue that the neutral theory was supported by unreliable theoretical and empirical evidence from the beginning, and that in light of modern, genome-scale data, we can firmly reject its universality. The ubiquity of adaptive variation both within and between species means that a more comprehensive theory of molecular evolution must be sought.
Assuntos
Evolução Molecular , Deriva Genética , Seleção Genética , Animais , HumanosRESUMO
Detecting the targets of adaptive natural selection from whole genome sequencing data is a central problem for population genetics. However, to date most methods have shown sub-optimal performance under realistic demographic scenarios. Moreover, over the past decade there has been a renewed interest in determining the importance of selection from standing variation in adaptation of natural populations, yet very few methods for inferring this model of adaptation at the genome scale have been introduced. Here we introduce a new method, S/HIC, which uses supervised machine learning to precisely infer the location of both hard and soft selective sweeps. We show that S/HIC has unrivaled accuracy for detecting sweeps under demographic histories that are relevant to human populations, and distinguishing sweeps from linked as well as neutrally evolving regions. Moreover, we show that S/HIC is uniquely robust among its competitors to model misspecification. Thus, even if the true demographic model of a population differs catastrophically from that specified by the user, S/HIC still retains impressive discriminatory power. Finally, we apply S/HIC to the case of resequencing data from human chromosome 18 in a European population sample, and demonstrate that we can reliably recover selective sweeps that have been identified earlier using less specific and sensitive methods.
Assuntos
Deriva Genética , Genética Populacional , Aprendizado de Máquina , Seleção Genética/genética , Cromossomos Humanos Par 18/genética , Genoma Humano , Haplótipos/genética , HumanosRESUMO
The degree to which adaptation in recent human evolution shapes genetic variation remains controversial. This is in part due to the limited evidence in humans for classic "hard selective sweeps", wherein a novel beneficial mutation rapidly sweeps through a population to fixation. However, positive selection may often proceed via "soft sweeps" acting on mutations already present within a population. Here, we examine recent positive selection across six human populations using a powerful machine learning approach that is sensitive to both hard and soft sweeps. We found evidence that soft sweeps are widespread and account for the vast majority of recent human adaptation. Surprisingly, our results also suggest that linked positive selection affects patterns of variation across much of the genome, and may increase the frequencies of deleterious mutations. Our results also reveal insights into the role of sexual selection, cancer risk, and central nervous system development in recent human evolution.
Assuntos
Adaptação Fisiológica/genética , Genoma Humano/genética , Aclimatação , Adaptação Biológica/genética , Bases de Dados de Ácidos Nucleicos , Evolução Molecular , Variação Genética/genética , Genética Populacional , Humanos , Aprendizado de Máquina , Mutação , Seleção Genética/genéticaRESUMO
Here we describe discoal, a coalescent simulator able to generate population samples that include selective sweeps in a feature-rich, flexible manner. discoal can perform simulations conditioning on the fixation of an allele due to drift or either hard or soft sweeps-even those occurring a large genetic distance away from the simulated locus. discoal can simulate sweeps with recurrent mutation to the adaptive allele, recombination, and gene conversion, under non-equilibrium demographic histories and without specifying an allele frequency trajectory in advance. AVAILABILITY AND IMPLEMENTATION: discoal is implemented in the C programming language. Source code is freely available on GitHub (https://github.com/kern-lab/discoal) under a GNU General Public License. CONTACT: kern@dls.rutgers.edu or dan.schrider@rutgers.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Alelos , Biologia Computacional/métodos , Genética Populacional/métodos , Software , Simulação por Computador , Frequência do Gene , Modelos Genéticos , Mutação , Linguagens de Programação , Processos EstocásticosRESUMO
Ultraconserved elements (UCEs), stretches of DNA that are identical between distantly related species, are enigmatic genomic features whose function is not well understood. First identified and characterized in mammals, UCEs have been proposed to play important roles in gene regulation, RNA processing, and maintaining genome integrity. However, because all of these functions can tolerate some sequence variation, their ultraconserved and ultraselected nature is not explained. We investigated whether there are highly conserved DNA elements without genic function in distantly related plant genomes. We compared the genomes of Arabidopsis thaliana and Vitis vinifera; species that diverged â¼115 million years ago (Mya). We identified 36 highly conserved elements with at least 85% similarity that are longer than 55 bp. Interestingly, these elements exhibit properties similar to mammalian UCEs, such that we named them UCE-like elements (ULEs). ULEs are located in intergenic or intronic regions and are depleted from segmental duplications. Like UCEs, ULEs are under strong purifying selection, suggesting a functional role for these elements. As their mammalian counterparts, ULEs show a sharp drop of A+T content at their borders and are enriched close to genes encoding transcription factors and genes involved in development, the latter showing preferential expression in undifferentiated tissues. By comparing the genomes of Brachypodium distachyon and Oryza sativa, species that diverged â¼50 Mya, we identified a different set of ULEs with similar properties in monocots. The identification of ULEs in plant genomes offers new opportunities to study their possible roles in genome function, integrity, and regulation.
Assuntos
Biologia Computacional/métodos , Sequência Conservada , Genoma de Planta , Arabidopsis/genética , Brachypodium/genética , Metilação de DNA , Evolução Molecular , Variação Genética , Íntrons , Oryza/genética , Seleção Genética , Análise de Sequência de DNA , Sorghum/genética , Vitis/genética , Zea mays/genéticaRESUMO
Here, we describe the construction of a phylogenetically deep, whole-genome alignment of 20 flowering plants, along with an analysis of plant genome conservation. Each included angiosperm genome was aligned to a reference genome, Arabidopsis thaliana, using the LASTZ/MULTIZ paradigm and tools from the University of California-Santa Cruz Genome Browser source code. In addition to the multiple alignment, we created a local genome browser displaying multiple tracks of newly generated genome annotation, as well as annotation sourced from published data of other research groups. An investigation into A. thaliana gene features present in the aligned A. lyrata genome revealed better conservation of start codons, stop codons, and splice sites within our alignments (51% of features from A. thaliana conserved without interruption in A. lyrata) when compared with previous publicly available plant pairwise alignments (34% of features conserved). The detailed view of conservation across angiosperms revealed not only high coding-sequence conservation but also a large set of previously uncharacterized intergenic conservation. From this, we annotated the collection of conserved features, revealing dozens of putative noncoding RNAs, including some with recorded small RNA expression. Comparing conservation between kingdoms revealed a faster decay of vertebrate genome features when compared with angiosperm genomes. Finally, conserved sequences were searched for folding RNA features, including but not limited to noncoding RNA (ncRNA) genes. Among these, we highlight a double hairpin in the 5'-untranslated region (5'-UTR) of the PRIN2 gene and a putative ncRNA with homology targeting the LAF3 protein.
Assuntos
Arabidopsis/genética , Códon/genética , Sequência Conservada/genética , Genoma de Planta , Animais , Bases de Dados Genéticas , Magnoliopsida/genética , RNA não Traduzido/genética , Alinhamento de Sequência , VertebradosRESUMO
The often tight association between parasites and their hosts means that under certain scenarios, the evolutionary histories of the two species can become closely coupled both through time and across space. Using spatial genetic inference, we identify a potential signal of common dispersal patterns in the Anopheles gambiae and Plasmodium falciparum host-parasite system as seen through a between-species correlation of the differences between geographic sampling location and geographic location predicted from the genome. This correlation may be due to coupled dispersal dynamics between host and parasite but may also reflect statistical artifacts due to uneven spatial distribution of sampling locations. Using continuous-space population genetics simulations, we investigate the degree to which uneven distribution of sampling locations leads to bias in prediction of spatial location from genetic data and implement methods to counter this effect. We demonstrate that while algorithmic bias presents a problem in inference from spatio-genetic data, the correlation structure between A. gambiae and P. falciparum predictions cannot be attributed to spatial bias alone and is thus likely a genetic signal of co-dispersal in a host-parasite system.
Assuntos
Anopheles , Malária Falciparum , Parasitos , Plasmodium , Animais , Parasitos/genética , Anopheles/genética , Anopheles/parasitologia , Interações Hospedeiro-Parasita/genética , Plasmodium/genética , Plasmodium falciparum/genética , GeografiaRESUMO
For at least the past 5 decades, population genetics, as a field, has worked to describe the precise balance of forces that shape patterns of variation in genomes. The problem is challenging because modeling the interactions between evolutionary processes is difficult, and different processes can impact genetic variation in similar ways. In this paper, we describe how diversity and divergence between closely related species change with time, using correlations between landscapes of genetic variation as a tool to understand the interplay between evolutionary processes. We find strong correlations between landscapes of diversity and divergence in a well-sampled set of great ape genomes, and explore how various processes such as incomplete lineage sorting, mutation rate variation, GC-biased gene conversion and selection contribute to these correlations. Through highly realistic, chromosome-scale, forward-in-time simulations, we show that the landscapes of diversity and divergence in the great apes are too well correlated to be explained via strictly neutral processes alone. Our best fitting simulation includes both deleterious and beneficial mutations in functional portions of the genome, in which 9% of fixations within those regions is driven by positive selection. This study provides a framework for modeling genetic variation in closely related species, an approach which can shed light on the complex balance of forces that have shaped genetic variation.
Assuntos
Variação Genética , Hominidae , Animais , Seleção Genética , Hominidae/genética , Mutação , GenômicaRESUMO
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity by descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology, and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN .
RESUMO
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity-by-descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN.
Assuntos
Genética Populacional , Redes Neurais de Computação , Polimorfismo de Nucleotídeo Único , Animais , Genética Populacional/métodos , Lobos/genética , Lobos/classificação , Densidade Demográfica , Demografia/métodos , GenótipoRESUMO
Individual-based simulation has become an increasingly crucial tool for many fields of population biology. However, implementing realistic and stable simulations in continuous space presents a variety of difficulties, from modeling choices to computational efficiency. This paper aims to be a practical guide to spatial simulation, helping researchers to implement realistic and efficient spatial, individual-based simulations and avoid common pitfalls. To do this, we delve into mechanisms of mating, reproduction, density-dependent feedback, and dispersal, all of which may vary across the landscape, discuss how these affect population dynamics, and describe how to parameterize simulations in convenient ways (for instance, to achieve a desired population density). We also demonstrate how to implement these models using the current version of the individual-based simulator, SLiM. Since SLiM has the capacity to simulate genomes, we also discuss natural selection - in particular, how genetic variation can affect demographic processes. Finally, we provide four short vignettes: simulations of pikas that shift their range up a mountain as temperatures rise; mosquitoes that live in rivers as juveniles and experience seasonally changing habitat; cane toads that expand across Australia, reaching 120 million individuals; and monarch butterflies whose populations are regulated by an explicitly modeled resource (milkweed).
RESUMO
Sex chromosomes are critical elements of sexual reproduction in many animal and plant taxa, however they show incredible diversity and rapid turnover even within clades. Until now, the mechanism of sex determination in cephlaopods has been a mystery. Using a chromosome-level genome assembly generated with long read sequencing, we report the first evidence for genetic sex determination in cephalopods. We have uncovered a sex chromosome in California two-spot octopus (Octopus bimaculoides ) in which males/females have ZZ/ZO karyotypes respectively. We show that the octopus Z chromosome is an evolutionary outlier with respect to divergence and repetitive element content as compared to autosomes and that it is present in all cephalopods that we have examined including Nautilus, the outgroup to squids and octopuses. Our results suggest that the cephalopod Z chromosome system originated before the split of all extant cephalopod lineages, over 480 million years ago and has been conserved to the present, making it the among the oldest conserved animal sex chromosome systems known.