Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
bioRxiv ; 2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38585733

RESUMO

Describing the distribution of genetic variation across individuals is a fundamental goal of population genetics. In humans, traditional approaches for describing population genetic variation often rely on discrete genetic ancestry labels, which, despite their utility, can obscure the complex, multifaceted nature of human genetic history. These labels risk oversimplifying ancestry by ignoring its temporal depth and geographic continuity, and may therefore conflate notions of race, ethnicity, geography, and genetic ancestry. Here, we present a method that capitalizes on the rich genealogical information encoded in genomic tree sequences to infer the geographic locations of the shared ancestors of a sample of sequenced individuals. We use this method to infer the geographic history of genetic ancestry of a set of human genomes sampled from Europe, Asia, and Africa, accurately recovering major population movements on those continents. Our findings demonstrate the importance of defining the spatial-temporal context of genetic ancestry to describing human genetic variation and caution against the oversimplified interpretations of genetic data prevalent in contemporary discussions of race and ancestry.

2.
bioRxiv ; 2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38586047

RESUMO

We present momi3, a new method for inferring complex demographic models using genetic variation data sampled from many populations. momi3 features many improvements over its predecessor momi2 (Kamm, Terhorst, Durbin, et al., 2020), including support for continuous migration, just-in-time compilation, and execution on GPUs; a standardized interface for specifying demographic models; and a novel importance sampling strategy that enables it to efficiently analyze data from a large number of samples. Together, these improvements lead to speedups of as much as 1000× over existing state-of-the-art methods such as ∂a∂i, moments, and momi2. We illustrate the usefulness of our method by revisiting a model of archaic admixture using a large, recent dataset containing hundreds of human genomes from many populations.

3.
bioRxiv ; 2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38585997

RESUMO

I present PHLASH, a new Bayesian method for inferring population history from whole genome sequence data. PHLASH is population history learning by averaging sampled histories: it works by drawing random, low-dimensional projections of the coalescent intensity function from the posterior distribution of a PSMC-like model, and averaging them together to form an accurate and adaptive size history estimator. On simulated data, PHLASH tends to be faster and have lower error than several competing methods including SMC++, MSMC2, and FITCOAL. Moreover, it provides a full posterior distribution over population size history, leading to automatic uncertainty quantification of the point estimates, as well to new Bayesian testing procedures for detecting population structure and ancient bottlenecks. On the technical side, the key advance is a novel algorithm for computing the score function (gradient of the log-likelihood) of a coalescent hidden Markov model: when there are M hidden states, the algorithm requires 𝒪M2 time and 𝒪1 memory per decoded position, the same cost as evaluating the log-likelihood itself using the naïve forward algorithm. This algorithm is combined with a hand-tuned implementation that fully leverages the power of modern GPU hardware, and the entire method has been released as an easy-to-use Python software package.

4.
Algorithms Mol Biol ; 18(1): 12, 2023 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-37559098

RESUMO

The Li-Stephens (LS) haplotype copying model forms the basis of a number of important statistical inference procedures in genetics. LS is a probabilistic generative model which supposes that a sampled chromosome is an imperfect mosaic of other chromosomes found in a population. In the frequentist setting which is the focus of this paper, the output of LS is a "copying path" through chromosome space. The behavior of LS depends crucially on two user-specified parameters, [Formula: see text] and [Formula: see text], which are respectively interpreted as the rates of mutation and recombination. However, because LS is not based on a realistic model of ancestry, the precise connection between these parameters and the biological phenomena they represent is unclear. Here, we offer an alternative perspective, which considers [Formula: see text] and [Formula: see text] as tuning parameters, and seeks to understand their impact on the LS output. We derive an algorithm which, for a given dataset, efficiently partitions the [Formula: see text] plane into regions where the output of the algorithm is constant, thereby enumerating all possible solutions to the LS model in one go. We extend this approach to the "diploid LS" model commonly used for phasing. We demonstrate the usefulness of our method by studying the effects of changing [Formula: see text] and [Formula: see text] when using LS for common bioinformatic tasks. Our findings indicate that using the conventional (i.e., population-scaled) values for [Formula: see text] and [Formula: see text] produces near optimal results for imputation, but may systematically inflate switch error in the case of phasing diploid genotypes.

5.
J Theor Biol ; 568: 111520, 2023 07 07.
Artigo em Inglês | MEDLINE | ID: mdl-37148965

RESUMO

Recent theoretical work on phylogenetic birth-death models offers differing viewpoints on whether they can be estimated using lineage-through-time data. Louca and Pennell (2020) showed that the class of models with continuously differentiable rate functions is nonidentifiable: any such model is consistent with an infinite collection of alternative models, which are statistically indistinguishable regardless of how much data are collected. Legried and Terhorst (2022) qualified this grave result by showing that identifiability is restored if only piecewise constant rate functions are considered. Here, we contribute new theoretical results to this discussion, in both the positive and negative directions. Our main result is to prove that models based on piecewise polynomial rate functions of any order and with any (finite) number of pieces are statistically identifiable. In particular, this implies that spline-based models with an arbitrary number of knots are identifiable. The proof is simple and self-contained, relying mainly on basic algebra. We complement this positive result with a negative one, which shows that even when identifiability holds, rate function estimation is still a difficult problem. To illustrate this, we prove some rates-of-convergence results for hypothesis testing using birth-death models. These results are information-theoretic lower bounds which apply to all potential estimators.


Assuntos
Algoritmos , Filogenia
6.
Cell ; 186(5): 923-939.e14, 2023 03 02.
Artigo em Inglês | MEDLINE | ID: mdl-36868214

RESUMO

We conduct high coverage (>30×) whole-genome sequencing of 180 individuals from 12 indigenous African populations. We identify millions of unreported variants, many predicted to be functionally important. We observe that the ancestors of southern African San and central African rainforest hunter-gatherers (RHG) diverged from other populations >200 kya and maintained a large effective population size. We observe evidence for ancient population structure in Africa and for multiple introgression events from "ghost" populations with highly diverged genetic lineages. Although currently geographically isolated, we observe evidence for gene flow between eastern and southern Khoesan-speaking hunter-gatherer populations lasting until ∼12 kya. We identify signatures of local adaptation for traits related to skin color, immune response, height, and metabolic processes. We identify a positively selected variant in the lightly pigmented San that influences pigmentation in vitro by regulating the enhancer activity and gene expression of PDPK1.


Assuntos
Aclimatação , Pigmentação da Pele , Humanos , Sequenciamento Completo do Genoma , Densidade Demográfica , África , Proteínas Quinases Dependentes de 3-Fosfoinositídeo
7.
Genome Res ; 32(11-12): 2057-2067, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36316157

RESUMO

We developed a novel method for efficiently estimating time-varying selection coefficients from genome-wide ancient DNA data. In simulations, our method accurately recovers selective trajectories and is robust to misspecification of population size. We applied it to a large data set of ancient and present-day human genomes from Britain and identified seven loci with genome-wide significant evidence of selection in the past 4500 yr. Almost all of them can be related to increased vitamin D or calcium levels, suggesting strong selective pressure on these or related phenotypes. However, the strength of selection on individual loci varied substantially over time, suggesting that cultural or environmental factors moderated the genetic response. Of 28 complex anthropometric and metabolic traits, skin pigmentation was the only one with significant evidence of polygenic selection, further underscoring the importance of phenotypes related to vitamin D. Our approach illustrates the power of ancient DNA to characterize selection in human populations and illuminates the recent evolutionary history of Britain.


Assuntos
DNA Antigo , Seleção Genética , Humanos , Reino Unido , Pigmentação da Pele , Genoma Humano
8.
Proc Natl Acad Sci U S A ; 119(35): e2119513119, 2022 08 30.
Artigo em Inglês | MEDLINE | ID: mdl-35994663

RESUMO

In a striking result, Louca and Pennell [S. Louca, M. W. Pennell, Nature 580, 502-505 (2020)] recently proved that a large class of phylogenetic birth-death models is statistically unidentifiable from lineage-through-time (LTT) data: Any pair of sufficiently smooth birth and death rate functions is "congruent" to an infinite collection of other rate functions, all of which have the same likelihood for any LTT vector of any dimension. As Louca and Pennell argue, this fact has distressing implications for the thousands of studies that have utilized birth-death models to study evolution. In this paper, we qualify their finding by proving that an alternative and widely used class of birth-death models is indeed identifiable. Specifically, we show that piecewise constant birth-death models can, in principle, be consistently estimated and distinguished from one another, given a sufficiently large extant timetree and some knowledge of the present-day population. Subject to mild regularity conditions, we further show that any unidentifiable birth-death model class can be arbitrarily closely approximated by a class of identifiable models. The sampling requirements needed for our results to hold are explicit and are expected to be satisfied in many contexts such as the phylodynamic analysis of a global pandemic.


Assuntos
Morte , Cadeias de Markov , Modelos Biológicos , Parto , Filogenia , Dinâmica Populacional , Evolução Biológica , Humanos , Pandemias
9.
Theor Popul Biol ; 147: 16-27, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36007782

RESUMO

A number of powerful demographic inference methods have been developed in recent years, with the goal of fitting rich evolutionary models to genetic data obtained from many populations. In this paper we investigate the statistical performance of these methods in the specific case where there is continuous migration between populations. Compared with earlier work, migration significantly complicates the theoretical analysis and requires new techniques. We employ the theories of phase-type distributions and concentration of measure in order to study the two-island and isolation-with-migration models, resulting in both upper and lower bounds on rates of convergence for parametric estimators in migration models. For the upper bounds, we consider inferring rates of coalescent and migration on the basis of directly observing pairwise coalescent times, and, more realistically, when (conditionally) Poisson-distributed mutations dropped on latent trees are observed. We complement these upper bounds with information-theoretic lower bounds which establish a limit, in terms of sample size, below which inference is effectively impossible.


Assuntos
Genética Populacional , Modelos Genéticos , Evolução Biológica
10.
Mol Biol Evol ; 39(8)2022 08 03.
Artigo em Inglês | MEDLINE | ID: mdl-35816422

RESUMO

The ongoing global pandemic has sharply increased the amount of data available to researchers in epidemiology and public health. Unfortunately, few existing analysis tools are capable of exploiting all of the information contained in a pandemic-scale data set, resulting in missed opportunities for improved surveillance and contact tracing. In this paper, we develop the variational Bayesian skyline (VBSKY), a method for fitting Bayesian phylodynamic models to very large pathogen genetic data sets. By combining recent advances in phylodynamic modeling, scalable Bayesian inference and differentiable programming, along with a few tailored heuristics, VBSKY is capable of analyzing thousands of genomes in a few minutes, providing accurate estimates of epidemiologically relevant quantities such as the effective reproduction number and overall sampling effort through time. We illustrate the utility of our method by performing a rapid analysis of a large number of SARS-CoV-2 genomes, and demonstrate that the resulting estimates closely track those derived from alternative sources of public health data.


Assuntos
COVID-19 , Pandemias , Teorema de Bayes , COVID-19/epidemiologia , Humanos , SARS-CoV-2/genética
11.
Genetics ; 220(3)2022 03 03.
Artigo em Inglês | MEDLINE | ID: mdl-35100408

RESUMO

Neutrality tests such as Tajima's D and Fay and Wu's H are standard implements in the population genetics toolbox. One of their most common uses is to scan the genome for signals of natural selection. However, it is well understood that D and H are confounded by other evolutionary forces-in particular, population expansion-that may be unrelated to selection. Because they are not model-based, it is not clear how to deconfound these tests in a principled way. In this article, we derive new likelihood-based methods for detecting natural selection, which are robust to fluctuations in effective population size. At the core of our method is a novel probabilistic model of tree imbalance, which generalizes Kingman's coalescent to allow certain aberrant tree topologies to arise more frequently than is expected under neutrality. We derive a frequency spectrum-based estimator that can be used in place of D, and also extend to the case where genealogies are first estimated. We benchmark our methods on real and simulated data, and provide an open source software implementation.


Assuntos
Modelos Genéticos , Árvores , Genética Populacional , Funções Verossimilhança , Modelos Estatísticos , Seleção Genética
12.
J Am Stat Assoc ; 115(531): 1472-1487, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33012903

RESUMO

The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than p reviously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed "basal Eurasian" admixture event in human history. We implement and release our method in a new open-source software package momi2.

13.
Proc Mach Learn Res ; 119: 7762-7771, 2020 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-34532709

RESUMO

A common workflow in data exploration is to learn a low-dimensional representation of the data, identify groups of points in that representation, and examine the differences between the groups to determine what they represent. We treat this workflow as an interpretable machine learning problem by leveraging the model that learned the low-dimensional representation to help identify the key differences between the groups. To solve this problem, we introduce a new type of explanation, a Global Counterfactual Explanation (GCE), and our algorithm, Transitive Global Translations (TGT), for computing GCEs. TGT identifies the differences between each pair of groups using compressed sensing but constrains those pairwise differences to be consistent among all of the groups. Empirically, we demonstrate that TGT is able to identify explanations that accurately explain the model while being relatively sparse, and that these explanations match real patterns in the data.

14.
Bioinformatics ; 36(3): 974-975, 2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-31400194

RESUMO

SUMMARY: Despite the availability of existing calculators for statistical power analysis in genetic association studies, there has not been a model-invariant and test-independent tool that allows for both planning of prospective studies and systematic review of reported findings. In this work, we develop a web-based application U-PASS (Unified Power analysis of ASsociation Studies), implementing a unified framework for the analysis of common association tests for binary qualitative traits. The application quantifies the shared asymptotic power limits of the common association tests, and visualizes the fundamental statistical trade-off between risk allele frequency and odds ratio. The application also addresses the applicability of asymptotics-based power calculations in finite samples, and provides guidelines for single-SNP-based association tests. In addition to designing prospective studies, U-PASS enables researchers to retrospectively assess the statistical validity of previously reported associations. AVAILABILITY AND IMPLEMENTATION: U-PASS is an open-source R Shiny application. A live instance is hosted at https://power.stat.lsa.umich.edu. Source is available on https://github.com/Pill-GZ/U-PASS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Software , Frequência do Gene , Estudos de Associação Genética , Fenótipo , Estudos Prospectivos , Estudos Retrospectivos
15.
Nat Genet ; 50(9): 1311-1317, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-30104759

RESUMO

Interest in reconstructing demographic histories has motivated the development of methods to estimate locus-specific pairwise coalescence times from whole-genome sequencing data. Here we introduce a powerful new method, ASMC, that can estimate coalescence times using only SNP array data, and is orders of magnitude faster than previous approaches. We applied ASMC to detect recent positive selection in 113,851 phased British samples from the UK Biobank, and detected 12 genome-wide significant signals, including 6 novel loci. We also applied ASMC to sequencing data from 498 Dutch individuals to detect background selection at deeper time scales. We detected strong heritability enrichment in regions of high background selection in an analysis of 20 independent diseases and complex traits using stratified linkage disequilibrium score regression, conditioned on a broad set of functional annotations (including other background selection annotations). These results underscore the widespread effects of background selection on the genetic architecture of complex traits.


Assuntos
Doença/genética , Desequilíbrio de Ligação/genética , Estudo de Associação Genômica Ampla/métodos , Ensaios de Triagem em Larga Escala/métodos , Humanos , Modelos Genéticos , Anotação de Sequência Molecular/métodos , Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética
16.
Curr Opin Genet Dev ; 53: 70-76, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30056275

RESUMO

Studying how diverse human populations are related is of historical and anthropological interest, in addition to providing a realistic null model for testing for signatures of natural selection or disease associations. Furthermore, understanding the demographic histories of other species is playing an increasingly important role in conservation genetics. A number of statistical methods have been developed to infer population demographic histories using whole-genome sequence data, with recent advances focusing on allowing for more flexible modeling choices, scaling to larger data sets, and increasing statistical power. Here we review coalescent hidden Markov models, a powerful class of population genetic inference methods that can utilize linkage disequilibrium information effectively. We highlight recent advances, give advice for practitioners, point out potential pitfalls, and present possible future research directions.


Assuntos
Evolução Molecular , Genética Populacional , Seleção Genética/genética , Genoma Humano/genética , Humanos , Cadeias de Markov , Sequenciamento Completo do Genoma
17.
Nature ; 553(7687): 203-207, 2018 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-29323294

RESUMO

Despite broad agreement that the Americas were initially populated via Beringia, the land bridge that connected far northeast Asia with northwestern North America during the Pleistocene epoch, when and how the peopling of the Americas occurred remains unresolved. Analyses of human remains from Late Pleistocene Alaska are important to resolving the timing and dispersal of these populations. The remains of two infants were recovered at Upward Sun River (USR), and have been dated to around 11.5 thousand years ago (ka). Here, by sequencing the USR1 genome to an average coverage of approximately 17 times, we show that USR1 is most closely related to Native Americans, but falls basal to all previously sequenced contemporary and ancient Native Americans. As such, USR1 represents a distinct Ancient Beringian population. Using demographic modelling, we infer that the Ancient Beringian population and ancestors of other Native Americans descended from a single founding population that initially split from East Asians around 36 ± 1.5 ka, with gene flow persisting until around 25 ± 1.1 ka. Gene flow from ancient north Eurasians into all Native Americans took place 25-20 ka, with Ancient Beringians branching off around 22-18.1 ka. Our findings support a long-term genetic structure in ancestral Native Americans, consistent with the Beringian 'standstill model'. We show that the basal northern and southern Native American branches, to which all other Native Americans belong, diverged around 17.5-14.6 ka, and that this probably occurred south of the North American ice sheets. We also show that after 11.5 ka, some of the northern Native American populations received gene flow from a Siberian population most closely related to Koryaks, but not Palaeo-Eskimos, Inuits or Kets, and that Native American gene flow into Inuits was through northern and not southern Native American groups. Our findings further suggest that the far-northern North American presence of northern Native Americans is from a back migration that replaced or absorbed the initial founding population of Ancient Beringians.


Assuntos
Efeito Fundador , Genoma Humano/genética , Indígenas Norte-Americanos/genética , Modelos Genéticos , Filogenia , Alaska , Ásia Oriental/etnologia , Fluxo Gênico , Genética Populacional , História Antiga , Migração Humana , Humanos , Lactente , Rios , Sibéria/etnologia , Fatores de Tempo
18.
J Comput Graph Stat ; 26(1): 182-194, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28239248

RESUMO

A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.

19.
Nat Genet ; 49(2): 303-309, 2017 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-28024154

RESUMO

It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.


Assuntos
Genoma/genética , África , Animais , Austrália , Simulação por Computador , Drosophila melanogaster/genética , Equidae/genética , Genética Populacional/métodos , Humanos , Modelos Genéticos , Densidade Demográfica
20.
Proc Natl Acad Sci U S A ; 113(20): E2822-31, 2016 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-27140647

RESUMO

The genetic, epigenetic, and physiological differences among cells in clonal microbial colonies are underexplored opportunities for discovery. A recently developed genetic assay reveals that transient losses of heterochromatic repression, a heritable form of gene silencing, occur throughout the growth of Saccharomyces colonies. This assay requires analyzing two-color fluorescence patterns in yeast colonies, which is qualitatively appealing but quantitatively challenging. In this paper, we developed a suite of automated image processing, visualization, and classification algorithms (MORPHE) that facilitated the analysis of heterochromatin dynamics in the context of colonial growth and that can be broadly adapted to many colony-based assays in Saccharomyces and other microbes. Using the features that were automatically extracted from fluorescence images, our classification method distinguished loss-of-silencing patterns between mutants and wild type with unprecedented precision. Application of MORPHE revealed subtle but significant differences in the stability of heterochromatic repression between various environmental conditions, revealed that haploid cells experienced higher rates of silencing loss than diploids, and uncovered the unexpected contribution of a sirtuin to heterochromatin dynamics.


Assuntos
Saccharomyces cerevisiae/metabolismo , Algoritmos , Bioensaio , Regulação Fúngica da Expressão Gênica , Inativação Gênica , Genes Reporter , Proteínas de Fluorescência Verde/biossíntese , Proteínas de Fluorescência Verde/genética , Processamento de Imagem Assistida por Computador , Fenótipo , Saccharomyces cerevisiae/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...