RESUMO
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these 2 fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we lay out a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., genome-wide association studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur analytically and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study, we re-examine an analysis testing for coevolution of expression levels between genes across a fungal phylogeny and show that including eigenvectors of the covariance matrix as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Assuntos
Estudo de Associação Genômica Ampla , Modelos Genéticos , Fenótipo , Filogenia , Estudo de Associação Genômica Ampla/métodos , Genética Populacional/métodos , Locos de Características Quantitativas , Humanos , Simulação por Computador , Modelos EstatísticosRESUMO
As genetic sequencing costs have plummeted, datasets with sizes previously un-thinkable have begun to appear. Such datasets present new opportunities to learn about evolutionary history, particularly via rare alleles that record the very recent past. However, beyond the computational challenges inherent in the analysis of many large-scale datasets, large population-genetic datasets present theoretical problems. In particular, the majority of population-genetic tools require the assumption that each mutant allele in the sample is the result of a single mutation (the "infinite sites" assumption), which is violated in large samples. Here, we present DR EVIL, a method for estimating mutation rates and recent demographic history from very large samples. DR EVIL avoids the infinite-sites assumption by using a diffusion approximation to a branching-process model with recurrent mutation. The branching-process approach limits the method to rare alleles, but, along with recent results, renders tractable likelihoods with recurrent mutation. We show that DR EVIL performs well in simulations and apply it to rare-variant data from a million haploid samples, identifying a signal of mutation-rate heterogeneity within commonly analyzed classes and predicting that in modern sample sizes, most rare variants at sites with high mutation rates represent the descendants of multiple mutation events.
RESUMO
Many remarkable innovations have repeatedly occurred across vast evolutionary distances. When convergent traits emerge on the tree of life, they are sometimes driven by the same underlying gene families, while other times many different gene families are involved. Conversely, a gene family may be repeatedly recruited for a single trait or many different traits. To understand the general rules governing convergence at both genomic and phenotypic levels, we systematically tested associations between 56 binary metabolic traits and gene count in 14,710 gene families from 993 species of Saccharomycotina yeasts. Using a recently developed phylogenetic approach that reduces spurious correlations, we discovered that gene family expansion and contraction was significantly linked to trait gain and loss in 45/56 (80%) of traits. While 601/746 (81%) of significant gene families were associated with only one trait, we also identified several 'keystone' gene families that were significantly associated with up to 13/56 (23%) of all traits. These results indicate that metabolic innovations in yeasts are governed by a narrow set of major genetic elements and mechanisms.
RESUMO
Many phenotypic traits have a polygenic genetic basis, making it challenging to learn their genetic architectures and predict individual phenotypes. One promising avenue to resolve the genetic basis of complex traits is through evolve-and-resequence experiments, in which laboratory populations are exposed to some selective pressure and trait-contributing loci are identified by extreme frequency changes over the course of the experiment. However, small laboratory populations will experience substantial random genetic drift, and it is difficult to determine whether selection played a roll in a given allele frequency change. Predicting how much allele frequencies change under drift and selection had remained an open problem well into the 21st century, even those contributing to simple, monogenic traits. Recently, there have been efforts to apply the path integral, a method borrowed from physics, to solve this problem. So far, this approach has been limited to genic selection, and is therefore inadequate to capture the complexity of quantitative, highly polygenic traits that are commonly studied. Here we extend one of these path integral methods, the perturbation approximation, to selection scenarios that are of interest to quantitative genetics. In particular, we derive analytic expressions for the transition probability (i.e., the probability that an allele will change in frequency from x , to y in time t ) of an allele contributing to a trait subject to stabilizing selection, as well as that of an allele contributing to a trait rapidly adapting to a new phenotypic optimum. We use these expressions to characterize the use of allele frequency change to test for selection, as well as explore optimal design choices for evolve-and-resequence experiments to uncover the genetic architecture of polygenic traits under selection.
RESUMO
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
RESUMO
Collecting genomics data across multiple heterogeneous populations (e.g., across different cancer types) has the potential to improve our understanding of disease. Despite sequencing advances, though, resources often remain a constraint when gathering data. So it would be useful for experimental design if experimenters with access to a pilot study could predict the number of new variants they might expect to find in a follow-up study: both the number of new variants shared between the populations and the total across the populations. While many authors have developed prediction methods for the single-population case, we show that these predictions can fare poorly across multiple populations that are heterogeneous. We prove that, surprisingly, a natural extension of a state-of-the-art single-population predictor to multiple populations fails for fundamental reasons. We provide the first predictor for the number of new shared variants and new total variants that can handle heterogeneity in multiple populations. We show that our proposed method works well empirically using real cancer and population genetics data.
RESUMO
Without the ability to control or randomize environments (or genotypes), it is difficult to determine the degree to which observed phenotypic differences between two groups of individuals are due to genetic vs. environmental differences. However, some have suggested that these concerns may be limited to pathological cases, and methods have appeared that seem to give-directly or indirectly-some support to claims that aggregate heritable variation within groups can be related to heritable variation among groups. We consider three families of approaches: the "between-group heritability" sometimes invoked in behavior genetics, the statistic [Formula: see text] used in empirical work in evolutionary quantitative genetics, and methods based on variation in ancestry in an admixed population, used in anthropological and statistical genetics. We take up these examples to show mathematically that information on within-group genetic and phenotypic information in the aggregate cannot separate among-group differences into genetic and environmental components, and we provide simulation results that support our claims. We discuss these results in terms of the long-running debate on this topic.
Assuntos
Evolução Biológica , Genética Populacional , Humanos , Fenótipo , Genótipo , Simulação por Computador , Variação GenéticaRESUMO
Without the ability to control or randomize environments (or genotypes), it is difficult to determine the degree to which observed phenotypic differences between two groups of individuals are due to genetic vs. environmental differences. However, some have suggested that these concerns may be limited to pathological cases, and methods have appeared that seem to give-directly or indirectly-some support to claims that aggregate heritable variation within groups can be related to heritable variation among groups. We consider three families of approaches: the "between-group heritability" sometimes invoked in behavior genetics, the statistic PST used in empirical work in evolutionary quantitative genetics, and methods based on variation in ancestry in an admixed population, used in anthropological and statistical genetics. We take up these examples to show mathematically that information on within-group genetic and phenotypic information in the aggregate cannot separate among-group differences into genetic and environmental components, and we provide simulation results that support our claims. We discuss these results in terms of the long-running debate on this topic.
RESUMO
Noncoding DNA is central to our understanding of human gene regulation and complex diseases1,2, and measuring the evolutionary sequence constraint can establish the functional relevance of putative regulatory elements in the human genome3-9. Identifying the genomic elements that have become constrained specifically in primates has been hampered by the faster evolution of noncoding DNA compared to protein-coding DNA10, the relatively short timescales separating primate species11, and the previously limited availability of whole-genome sequences12. Here we construct a whole-genome alignment of 239 species, representing nearly half of all extant species in the primate order. Using this resource, we identified human regulatory elements that are under selective constraint across primates and other mammals at a 5% false discovery rate. We detected 111,318 DNase I hypersensitivity sites and 267,410 transcription factor binding sites that are constrained specifically in primates but not across other placental mammals and validate their cis-regulatory effects on gene expression. These regulatory elements are enriched for human genetic variants that affect gene expression and complex traits and diseases. Our results highlight the important role of recent evolution in regulatory sequence elements differentiating primates, including humans, from other placental mammals.
Assuntos
Sequência Conservada , Evolução Molecular , Genoma , Primatas , Animais , Feminino , Humanos , Gravidez , Sequência Conservada/genética , Desoxirribonuclease I/metabolismo , DNA/genética , DNA/metabolismo , Genoma/genética , Mamíferos/classificação , Mamíferos/genética , Placenta , Primatas/classificação , Primatas/genética , Sequências Reguladoras de Ácido Nucleico/genética , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismo , Proteínas/genética , Regulação da Expressão Gênica/genéticaRESUMO
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide association studies (GWASs) are a powerful way to find genetic loci associated with phenotypes. GWASs are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix (local eGRM) given the ARG. Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to analyze two chromosomes containing known body size loci in a sample of Native Hawaiians. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Assuntos
Genética Populacional , Estudo de Associação Genômica Ampla , Locos de Características Quantitativas , Humanos , Mapeamento Cromossômico/métodos , Modelos Genéticos , Fenótipo , Locos de Características Quantitativas/genética , Havaiano Nativo ou Outro Ilhéu do Pacífico/genéticaRESUMO
We examined 454,712 exomes for genes associated with a wide spectrum of complex traits and common diseases and observed that rare, penetrant mutations in genes implicated by genome-wide association studies confer ~10-fold larger effects than common variants in the same genes. Consequently, an individual at the phenotypic extreme and at the greatest risk for severe, early-onset disease is better identified by a few rare penetrant variants than by the collective action of many common variants with weak effects. By combining rare variants across phenotype-associated genes into a unified genetic risk model, we demonstrate superior portability across diverse global populations compared with common-variant polygenic risk scores, greatly improving the clinical utility of genetic-based risk prediction.
Assuntos
Predisposição Genética para Doença , Herança Multifatorial , Penetrância , Humanos , Estudo de Associação Genômica Ampla , Mutação , Fenótipo , Fatores de RiscoRESUMO
The rich diversity of morphology and behavior displayed across primate species provides an informative context in which to study the impact of genomic diversity on fundamental biological processes. Analysis of that diversity provides insight into long-standing questions in evolutionary and conservation biology and is urgent given severe threats these species are facing. Here, we present high-coverage whole-genome data from 233 primate species representing 86% of genera and all 16 families. This dataset was used, together with fossil calibration, to create a nuclear DNA phylogeny and to reassess evolutionary divergence times among primate clades. We found within-species genetic diversity across families and geographic regions to be associated with climate and sociality, but not with extinction risk. Furthermore, mutation rates differ across species, potentially influenced by effective population sizes. Lastly, we identified extensive recurrence of missense mutations previously thought to be human specific. This study will open a wide range of research avenues for future primate genomic research.
Assuntos
Evolução Biológica , Variação Genética , Primatas , Animais , Humanos , Genoma , Taxa de Mutação , Filogenia , Primatas/genética , Densidade DemográficaRESUMO
Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.
Assuntos
Variação Genética , Primatas , Animais , Humanos , Sequência de Bases , Frequência do Gene , Primatas/genética , Sequenciamento Completo do GenomaRESUMO
We examined 454,712 exomes for genes associated with a wide spectrum of complex traits and common diseases and observed that rare, penetrant mutations in genes implicated by genome-wide association studies confer â¼10-fold larger effects than common variants in the same genes. Consequently, an individual at the phenotypic extreme and at the greatest risk for severe, early-onset disease is better identified by a few rare penetrant variants than by the collective action of many common variants with weak effects. By combining rare variants across phenotype-associated genes into a unified genetic risk model, we demonstrate superior portability across diverse global populations compared to common variant polygenic risk scores, greatly improving the clinical utility of genetic-based risk prediction. One sentence summary: Rare variant polygenic risk scores identify individuals with outlier phenotypes in common human diseases and complex traits.
RESUMO
Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole genome sequencing data for 809 individuals from 233 primate species, and identified 4.3 million common protein-altering variants with orthologs in human. We show that these variants can be inferred to have non-deleterious effects in human based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases. One Sentence Summary: Deep learning classifier trained on 4.3 million common primate missense variants predicts variant pathogenicity in humans.
RESUMO
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide Association Studies (GWAS) are a powerful way to find genetic loci associated with phenotypes. GWAS are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix given the ARG (local eGRM). Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to identify a large-effect BMI locus, the CREBRF gene, in a sample of Native Hawaiians in which it was not previously detectable by GWAS because of a lack of population-specific imputation resources. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
RESUMO
Human populations have been shaped by catastrophes that may have left long-lasting signatures in their genomes. One notable example is the second plague pandemic that entered Europe in ca. 1,347 CE and repeatedly returned for over 300 years, with typical village and town mortality estimated at 10%-40%.1 It is assumed that this high mortality affected the gene pools of these populations. First, local population crashes reduced genetic diversity. Second, a change in frequency is expected for sequence variants that may have affected survival or susceptibility to the etiologic agent (Yersinia pestis).2 Third, mass mortality might alter the local gene pools through its impact on subsequent migration patterns. We explored these factors using the Norwegian city of Trondheim as a model, by sequencing 54 genomes spanning three time periods: (1) prior to the plague striking Trondheim in 1,349 CE, (2) the 17th-19th century, and (3) the present. We find that the pandemic period shaped the gene pool by reducing long distance immigration, in particular from the British Isles, and inducing a bottleneck that reduced genetic diversity. Although we also observe an excess of large FST values at multiple loci in the genome, these are shaped by reference biases introduced by mapping our relatively low genome coverage degraded DNA to the reference genome. This implies that attempts to detect selection using ancient DNA (aDNA) datasets that vary by read length and depth of sequencing coverage may be particularly challenging until methods have been developed to account for the impact of differential reference bias on test statistics.
Assuntos
Peste , Humanos , Peste/epidemiologia , Peste/genética , Pandemias/história , Metagenômica , Genoma Bacteriano , FilogeniaRESUMO
The genetic architecture of atrial fibrillation (AF) encompasses low impact, common genetic variants and high impact, rare variants. Here, we characterize a high impact AF-susceptibility allele, KCNQ1 R231H, and describe its transcontinental geographic distribution and history. Induced pluripotent stem cell-derived cardiomyocytes procured from risk allele carriers exhibit abbreviated action potential duration, consistent with a gain-of-function effect. Using identity-by-descent (IBD) networks, we estimate the broad- and fine-scale population ancestry of risk allele carriers and their relatives. Analysis of ancestral migration routes reveals ancestors who inhabited Denmark in the 1700s, migrated to the Northeastern United States in the early 1800s, and traveled across the Midwest to arrive in Utah in the late 1800s. IBD/coalescent-based allele dating analysis reveals a relatively recent origin of the AF risk allele (~5000 years). Thus, our approach broadens the scope of study for disease susceptibility alleles to the context of human migration and ancestral origins.
Assuntos
Fibrilação Atrial/genética , Predisposição Genética para Doença/genética , Canal de Potássio KCNQ1/genética , Mutação de Sentido Incorreto , Polimorfismo de Nucleotídeo Único , Potenciais de Ação , Alelos , Dinamarca , Emigrantes e Imigrantes , Feminino , Genótipo , Geografia , Humanos , Células-Tronco Pluripotentes Induzidas/citologia , Células-Tronco Pluripotentes Induzidas/metabolismo , Masculino , Pessoa de Meia-Idade , Miócitos Cardíacos/citologia , Miócitos Cardíacos/metabolismo , Miócitos Cardíacos/fisiologia , Linhagem , Fatores de Risco , UtahRESUMO
BACKGROUND: We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual's ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. RESULTS: The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and reused to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations using 10 CPU). We test ARCHes on public data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP) as well as simulated examples of known admixture. CONCLUSIONS: Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at finer population scales regardless of the amount of population admixture.
Assuntos
Genética Populacional , Genoma Humano , Haplótipos , Humanos , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Neanderthals and anatomically modern humans overlapped geographically for a period of over 30,000 years following human migration out of Africa. During this period, Neanderthals and humans interbred, as evidenced by Neanderthal portions of the genome carried by non-African individuals today. A key observation is that the proportion of Neanderthal ancestry is ~12-20% higher in East Asian individuals relative to European individuals. Here, we explore various demographic models that could explain this observation. These include distinguishing between a single admixture event and multiple Neanderthal contributions to either population, and the hypothesis that reduced Neanderthal ancestry in modern Europeans resulted from more recent admixture with a ghost population that lacked a Neanderthal ancestry component (the 'dilution' hypothesis). To summarize the asymmetric pattern of Neanderthal allele frequencies, we compiled the joint fragment frequency spectrum of European and East Asian Neanderthal fragments and compared it with both analytical theory and data simulated under various models of admixture. Using maximum-likelihood and machine learning, we found that a simple model of a single admixture did not fit the empirical data, and instead favour a model of multiple episodes of gene flow into both European and East Asian populations. These findings indicate a longer-term, more complex interaction between humans and Neanderthals than was previously appreciated.