Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 27
Filter
1.
Bioinformatics ; 40(6)2024 Jun 03.
Article in English | MEDLINE | ID: mdl-38796683

ABSTRACT

SUMMARY: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).


Subject(s)
Genome-Wide Association Study , Recombination, Genetic , Software , Genome-Wide Association Study/methods , Quantitative Trait Loci , Humans , Genetics, Population/methods , Phenotype , Genotype , Computer Simulation
2.
Am J Hum Genet ; 107(4): 583-588, 2020 10 01.
Article in English | MEDLINE | ID: mdl-33007197

ABSTRACT

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.


Subject(s)
Genetics, Population/trends , Genome, Human , Models, Genetic , Software , Algorithms , Computer Simulation , Demography , Genetic Variation , Genetics, Population/history , History, Ancient , Human Migration/history , Human Migration/statistics & numerical data , Humans
3.
PLoS Comput Biol ; 18(3): e1009960, 2022 03.
Article in English | MEDLINE | ID: mdl-35263345

ABSTRACT

We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.


Subject(s)
Models, Genetic , Software , Algorithms , Bayes Theorem , Markov Chains , Phylogeny , Recombination, Genetic/genetics
4.
PLoS Genet ; 16(5): e1008619, 2020 05.
Article in English | MEDLINE | ID: mdl-32369493

ABSTRACT

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.


Subject(s)
Algorithms , Base Sequence/physiology , Genetics, Population/methods , Genome-Wide Association Study/methods , Models, Genetic , Cohort Studies , Computer Simulation , Evolution, Molecular , Genome/genetics , Genome-Wide Association Study/statistics & numerical data , Humans , Linkage Disequilibrium , Recombination, Genetic/physiology , Sample Size
5.
Bioinformatics ; 35(1): 119-121, 2019 01 01.
Article in English | MEDLINE | ID: mdl-29931085

ABSTRACT

Summary: Standardized interfaces for efficiently accessing high-throughput sequencing data are a fundamental requirement for large-scale genomic data sharing. We have developed htsget, a protocol for secure, efficient and reliable access to sequencing read and variation data. We demonstrate four independent client and server implementations, and the results of a comprehensive interoperability demonstration. Availability and implementation: http://samtools.github.io/hts-specs/htsget.html. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Genomics , High-Throughput Nucleotide Sequencing , Software , Genome
6.
PLoS Comput Biol ; 14(11): e1006581, 2018 11.
Article in English | MEDLINE | ID: mdl-30383757

ABSTRACT

In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly 'simplify' a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of N individuals with M polymorphic sites can be stored in O(N log N + M) space, making it feasible to store a simulation's entire final generation as well as its history. (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history. The software for recording and processing tree sequences is named tskit.


Subject(s)
Computational Biology/methods , Genetic Variation , Genetics, Population , Software , Algorithms , Computer Simulation , Gene Frequency , Genome , Genotype , Humans , Models, Genetic , Pedigree , Polymorphism, Genetic
7.
PLoS Comput Biol ; 12(5): e1004842, 2016 05.
Article in English | MEDLINE | ID: mdl-27145223

ABSTRACT

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.


Subject(s)
Genetic Variation , Models, Genetic , Pedigree , Algorithms , Computational Biology , Computer Simulation , Evolution, Molecular , Genetics, Population , Humans , Recombination, Genetic , Sample Size
8.
Bioinformatics ; 29(7): 955-6, 2013 Apr 01.
Article in English | MEDLINE | ID: mdl-23391497

ABSTRACT

UNLABELLED: Coalescent simulation has become an indispensable tool in population genetics, and many complex evolutionary scenarios have been incorporated into the basic algorithm. Despite many years of intense interest in spatial structure, however, there are no available methods to simulate the ancestry of a sample of genes that occupy a spatial continuum. This is mainly due to the severe technical problems encountered by the classical model of isolation by distance. A recently introduced model solves these technical problems and provides a solid theoretical basis for the study of populations evolving in continuous space. We present a detailed algorithm to simulate the coalescent process in this model, and provide an efficient implementation of a generalized version of this algorithm as a freely available Python module. AVAILABILITY: Package available at http://pypi.python.org/pypi/ercs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Models, Genetic , Computer Simulation , Evolution, Molecular , Genes , Genetics, Population/methods , Pedigree , Software
9.
bioRxiv ; 2024 Mar 14.
Article in English | MEDLINE | ID: mdl-38559118

ABSTRACT

Summary: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact: daiki.tagami@hertford.ox.ac.uk.

10.
bioRxiv ; 2024 Mar 13.
Article in English | MEDLINE | ID: mdl-38559261

ABSTRACT

Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.

11.
bioRxiv ; 2024 Jun 12.
Article in English | MEDLINE | ID: mdl-38915693

ABSTRACT

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results: We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.

12.
BMC Bioinformatics ; 14: 356, 2013 Dec 05.
Article in English | MEDLINE | ID: mdl-24308302

ABSTRACT

BACKGROUND: Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found. RESULTS: We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats. CONCLUSIONS: Wormtable's simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.


Subject(s)
Computational Biology/methods , Databases, Factual , Electronic Data Processing/methods , Genome, Human , Genomics/methods , Software/trends , Animals , Computer Simulation , Drosophila Proteins/genetics , Genome, Insect , Genomics/instrumentation , Humans , Libraries, Digital/trends , Random Allocation , Search Engine
13.
Bioinform Adv ; 3(1): vbad163, 2023.
Article in English | MEDLINE | ID: mdl-38033661

ABSTRACT

Summary: It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM. Availability and implementation: A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.dev/tskit/docs/stable/). We also provide a user-friendly wrapper for link-ancestors in tspop, a Python-based utility package.

14.
bioRxiv ; 2023 Nov 04.
Article in English | MEDLINE | ID: mdl-37961279

ABSTRACT

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). New developments have made it possible to infer ARGs at scale, enabling many new applications in population and statistical genetics. This rapid progress, however, has led to a substantial gap opening between theory and practice. Standard mathematical formalisms, based on exhaustively detailing the "events" that occur in the history of a sample, are insufficient to describe the outputs of current methods. Moreover, we argue that the underlying assumption that all events can be known and precisely estimated is fundamentally unsuited to the realities of modern, population-scale datasets. We propose an alternative mathematical formulation that encompasses the outputs of recent methods and can capture the full richness of modern large-scale datasets. By defining this ARG encoding in terms of specific genomes and their intervals of genetic inheritance, we avoid the need to exhaustively list (and estimate) all events. The effects of multiple events can be aggregated in different ways, providing a natural way to express many forms of approximate and partial knowledge about the recombinant ancestry of a sample.

15.
Science ; 380(6647): 849-855, 2023 05 26.
Article in English | MEDLINE | ID: mdl-37228217

ABSTRACT

Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models. Geographic features shaped migrations, and we find enrichments for migration, genetic, and genealogical relatedness patterns within river networks across regions of Quebec. Finally, we provide a freely accessible simulated whole-genome sequence dataset with spatiotemporal metadata for 1,426,749 individuals reflecting intricate French Canadian population structure. Such realistic population-scale simulations provide opportunities to investigate population genetics at an unprecedented resolution.


Subject(s)
Datasets as Topic , Pedigree , Population , Humans , Alleles , Canada , Genetics, Population , Genotype , Quebec , France/ethnology , Population/genetics , Whole Genome Sequencing , Models, Genetic , Human Migration , Genetic Variation
16.
Elife ; 122023 06 21.
Article in English | MEDLINE | ID: mdl-37342968

ABSTRACT

Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.


Subject(s)
Genome , Software , Computer Simulation , Genetics, Population , Genomics
17.
Science ; 375(6583): eabi8264, 2022 02 25.
Article in English | MEDLINE | ID: mdl-35201891

ABSTRACT

The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.


Subject(s)
DNA, Ancient , Genome, Human , Genomics , Pedigree , Africa , Chromosomes, Human, Pair 20/genetics , Computer Simulation , Databases, Nucleic Acid , Datasets as Topic , Evolution, Molecular , Genetic Variation , Genetics, Population , Geography , Haplotypes , Human Migration , Humans , Mutation , Sequence Analysis, DNA , Spatio-Temporal Analysis , Statistics, Nonparametric
18.
Genetics ; 222(3)2022 11 01.
Article in English | MEDLINE | ID: mdl-36173327

ABSTRACT

Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field. In particular, the important task of translating the model descriptions in published work into input suitable for population genetic simulators is labor intensive and error prone. We propose the Demes data model and file format, built on widely used technologies, to alleviate these issues. Demes provide a well-defined and unambiguous model of populations and their properties that is straightforward to implement in software, and a text file format that is designed for simplicity and clarity. We provide thoroughly tested implementations of Demes parsers in multiple languages including Python and C, and showcase initial support in several simulators and inference methods. An introduction to the file format and a detailed specification are available at https://popsim-consortium.github.io/demes-spec-docs/.


Subject(s)
Genetics, Population , Software , Demography
19.
Genetics ; 220(3)2022 03 03.
Article in English | MEDLINE | ID: mdl-34897427

ABSTRACT

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.


Subject(s)
Algorithms , Models, Genetic , Computer Simulation , Genetics, Population , Mutation , Software
20.
Methods Mol Biol ; 2090: 191-230, 2020.
Article in English | MEDLINE | ID: mdl-31975169

ABSTRACT

Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest. We illustrate msprime's flexibility by implementing a simple (but functional) approximate Bayesian computation inference method in just a few tens of lines of code.


Subject(s)
Computational Biology/methods , Genetics, Population/methods , Algorithms , Bayes Theorem , Models, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL