Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Bioinformatics ; 2024 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-38796683

RESUMO

SUMMARY: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).

2.
bioRxiv ; 2024 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-38559118

RESUMO

Summary: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact: daiki.tagami@hertford.ox.ac.uk.

3.
bioRxiv ; 2024 Mar 13.
Artigo em Inglês | MEDLINE | ID: mdl-38559261

RESUMO

Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.

4.
Bioinform Adv ; 3(1): vbad163, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38033661

RESUMO

Summary: It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM. Availability and implementation: A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.dev/tskit/docs/stable/). We also provide a user-friendly wrapper for link-ancestors in tspop, a Python-based utility package.

5.
bioRxiv ; 2023 Nov 04.
Artigo em Inglês | MEDLINE | ID: mdl-37961279

RESUMO

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). New developments have made it possible to infer ARGs at scale, enabling many new applications in population and statistical genetics. This rapid progress, however, has led to a substantial gap opening between theory and practice. Standard mathematical formalisms, based on exhaustively detailing the "events" that occur in the history of a sample, are insufficient to describe the outputs of current methods. Moreover, we argue that the underlying assumption that all events can be known and precisely estimated is fundamentally unsuited to the realities of modern, population-scale datasets. We propose an alternative mathematical formulation that encompasses the outputs of recent methods and can capture the full richness of modern large-scale datasets. By defining this ARG encoding in terms of specific genomes and their intervals of genetic inheritance, we avoid the need to exhaustively list (and estimate) all events. The effects of multiple events can be aggregated in different ways, providing a natural way to express many forms of approximate and partial knowledge about the recombinant ancestry of a sample.

6.
Elife ; 122023 06 21.
Artigo em Inglês | MEDLINE | ID: mdl-37342968

RESUMO

Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.


Assuntos
Genoma , Software , Simulação por Computador , Genética Populacional , Genômica
7.
Science ; 380(6647): 849-855, 2023 05 26.
Artigo em Inglês | MEDLINE | ID: mdl-37228217

RESUMO

Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models. Geographic features shaped migrations, and we find enrichments for migration, genetic, and genealogical relatedness patterns within river networks across regions of Quebec. Finally, we provide a freely accessible simulated whole-genome sequence dataset with spatiotemporal metadata for 1,426,749 individuals reflecting intricate French Canadian population structure. Such realistic population-scale simulations provide opportunities to investigate population genetics at an unprecedented resolution.


Assuntos
Conjuntos de Dados como Assunto , Linhagem , População , Humanos , Alelos , Canadá , Genética Populacional , Genótipo , Quebeque , França/etnologia , População/genética , Sequenciamento Completo do Genoma , Modelos Genéticos , Migração Humana , Variação Genética
8.
Genetics ; 222(3)2022 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-36173327

RESUMO

Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field. In particular, the important task of translating the model descriptions in published work into input suitable for population genetic simulators is labor intensive and error prone. We propose the Demes data model and file format, built on widely used technologies, to alleviate these issues. Demes provide a well-defined and unambiguous model of populations and their properties that is straightforward to implement in software, and a text file format that is designed for simplicity and clarity. We provide thoroughly tested implementations of Demes parsers in multiple languages including Python and C, and showcase initial support in several simulators and inference methods. An introduction to the file format and a detailed specification are available at https://popsim-consortium.github.io/demes-spec-docs/.


Assuntos
Genética Populacional , Software , Demografia
9.
PLoS Comput Biol ; 18(3): e1009960, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35263345

RESUMO

We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.


Assuntos
Modelos Genéticos , Software , Algoritmos , Teorema de Bayes , Cadeias de Markov , Filogenia , Recombinação Genética/genética
10.
Science ; 375(6583): eabi8264, 2022 02 25.
Artigo em Inglês | MEDLINE | ID: mdl-35201891

RESUMO

The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.


Assuntos
DNA Antigo , Genoma Humano , Genômica , Linhagem , África , Cromossomos Humanos Par 20/genética , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Conjuntos de Dados como Assunto , Evolução Molecular , Variação Genética , Genética Populacional , Geografia , Haplótipos , Migração Humana , Humanos , Mutação , Análise de Sequência de DNA , Análise Espaço-Temporal , Estatísticas não Paramétricas
11.
Genetics ; 220(3)2022 03 03.
Artigo em Inglês | MEDLINE | ID: mdl-34897427

RESUMO

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.


Assuntos
Algoritmos , Modelos Genéticos , Simulação por Computador , Genética Populacional , Mutação , Software
12.
Am J Hum Genet ; 107(4): 583-588, 2020 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-33007197

RESUMO

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.


Assuntos
Genética Populacional/tendências , Genoma Humano , Modelos Genéticos , Software , Algoritmos , Simulação por Computador , Demografia , Variação Genética , Genética Populacional/história , História Antiga , Migração Humana/história , Migração Humana/estatística & dados numéricos , Humanos
13.
Elife ; 92020 06 23.
Artigo em Inglês | MEDLINE | ID: mdl-32573438

RESUMO

The explosion in population genomic data demands ever more complex modes of analysis, and increasingly, these analyses depend on sophisticated simulations. Recent advances in population genetic simulation have made it possible to simulate large and complex models, but specifying such models for a particular simulation engine remains a difficult and error-prone task. Computational genetics researchers currently re-implement simulation models independently, leading to inconsistency and duplication of effort. This situation presents a major barrier to empirical researchers seeking to use simulations for power analyses of upcoming studies or sanity checks on existing genomic data. Population genetics, as a field, also lacks standard benchmarks by which new tools for inference might be measured. Here, we describe a new resource, stdpopsim, that attempts to rectify this situation. Stdpopsim is a community-driven open source project, which provides easy access to a growing catalog of published simulation models from a range of organisms and supports multiple simulation engine backends. This resource is available as a well-documented python library with a simple command-line interface. We share some examples demonstrating how stdpopsim can be used to systematically compare demographic inference methods, and we encourage a broader community of developers to contribute to this growing resource.


Assuntos
Genética Populacional , Biblioteca Genômica , Modelos Genéticos , Animais , Arabidopsis/genética , Cães/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Genética Populacional/métodos , Genética Populacional/organização & administração , Genoma/genética , Genoma Humano/genética , Humanos , Pongo abelii/genética
14.
PLoS Genet ; 16(5): e1008619, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32369493

RESUMO

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.


Assuntos
Algoritmos , Sequência de Bases/fisiologia , Genética Populacional/métodos , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Estudos de Coortes , Simulação por Computador , Evolução Molecular , Genoma/genética , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Desequilíbrio de Ligação , Recombinação Genética/fisiologia , Tamanho da Amostra
15.
Genetics ; 215(3): 779-797, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32357960

RESUMO

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.


Assuntos
Genoma Humano , Modelos Genéticos , Linhagem , Polimorfismo Genético , Evolução Molecular , Loci Gênicos , Genética Populacional/métodos , Genética Populacional/normas , Humanos
16.
Methods Mol Biol ; 2090: 191-230, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31975169

RESUMO

Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest. We illustrate msprime's flexibility by implementing a simple (but functional) approximate Bayesian computation inference method in just a few tens of lines of code.


Assuntos
Biologia Computacional/métodos , Genética Populacional/métodos , Algoritmos , Teorema de Bayes , Modelos Genéticos
17.
Nat Genet ; 51(11): 1660, 2019 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-31591513

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

18.
Nat Genet ; 51(9): 1330-1338, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31477934

RESUMO

Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.


Assuntos
Algoritmos , Evolução Molecular , Genética Populacional , Genoma Humano , Linhagem , Seleção Genética , Simulação por Computador , Conjuntos de Dados como Assunto , Haplótipos , Humanos , Modelos Genéticos , Mutação , Polimorfismo de Nucleotídeo Único , Densidade Demográfica
19.
Bioinformatics ; 35(1): 119-121, 2019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29931085

RESUMO

Summary: Standardized interfaces for efficiently accessing high-throughput sequencing data are a fundamental requirement for large-scale genomic data sharing. We have developed htsget, a protocol for secure, efficient and reliable access to sequencing read and variation data. We demonstrate four independent client and server implementations, and the results of a comprehensive interoperability demonstration. Availability and implementation: http://samtools.github.io/hts-specs/htsget.html. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma
20.
Mol Ecol Resour ; 19(2): 552-566, 2019 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-30565882

RESUMO

There is an increasing demand for evolutionary models to incorporate relatively realistic dynamics, ranging from selection at many genomic sites to complex demography, population structure, and ecological interactions. Such models can generally be implemented as individual-based forward simulations, but the large computational overhead of these models often makes simulation of whole chromosome sequences in large populations infeasible. This situation presents an important obstacle to the field that requires conceptual advances to overcome. The recently developed tree-sequence recording method (Kelleher, Thornton, Ashander, & Ralph, 2018), which stores the genealogical history of all genomes in the simulated population, could provide such an advance. This method has several benefits: (1) it allows neutral mutations to be omitted entirely from forward-time simulations and added later, thereby dramatically improving computational efficiency; (2) it allows neutral burn-in to be constructed extremely efficiently after the fact, using "recapitation"; (3) it allows direct examination and analysis of the genealogical trees along the genome; and (4) it provides a compact representation of a population's genealogy that can be analysed in Python using the msprime package. We have implemented the tree-sequence recording method in SLiM 3 (a free, open-source evolutionary simulation software package) and extended it to allow the recording of non-neutral mutations, greatly broadening the utility of this method. To demonstrate the versatility and performance of this approach, we showcase several practical applications that would have been beyond the reach of previously existing methods, opening up new horizons for the modelling and exploration of evolutionary processes.


Assuntos
Evolução Biológica , Genética Populacional/métodos , Biologia Computacional , Simulação por Computador , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA