Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Genetics ; 2024 Jul 16.
Artículo en Inglés | MEDLINE | ID: mdl-39013109

RESUMEN

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.

2.
bioRxiv ; 2024 Jun 12.
Artículo en Inglés | MEDLINE | ID: mdl-38915693

RESUMEN

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results: We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.

3.
Bioinformatics ; 40(6)2024 Jun 03.
Artículo en Inglés | MEDLINE | ID: mdl-38796683

RESUMEN

SUMMARY: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).


Asunto(s)
Estudio de Asociación del Genoma Completo , Recombinación Genética , Programas Informáticos , Estudio de Asociación del Genoma Completo/métodos , Sitios de Carácter Cuantitativo , Humanos , Genética de Población/métodos , Fenotipo , Genotipo , Simulación por Computador
4.
bioRxiv ; 2024 Mar 14.
Artículo en Inglés | MEDLINE | ID: mdl-38559118

RESUMEN

Summary: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact: daiki.tagami@hertford.ox.ac.uk.

5.
bioRxiv ; 2024 Mar 13.
Artículo en Inglés | MEDLINE | ID: mdl-38559261

RESUMEN

Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.

6.
bioRxiv ; 2023 Nov 04.
Artículo en Inglés | MEDLINE | ID: mdl-37961279

RESUMEN

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). New developments have made it possible to infer ARGs at scale, enabling many new applications in population and statistical genetics. This rapid progress, however, has led to a substantial gap opening between theory and practice. Standard mathematical formalisms, based on exhaustively detailing the "events" that occur in the history of a sample, are insufficient to describe the outputs of current methods. Moreover, we argue that the underlying assumption that all events can be known and precisely estimated is fundamentally unsuited to the realities of modern, population-scale datasets. We propose an alternative mathematical formulation that encompasses the outputs of recent methods and can capture the full richness of modern large-scale datasets. By defining this ARG encoding in terms of specific genomes and their intervals of genetic inheritance, we avoid the need to exhaustively list (and estimate) all events. The effects of multiple events can be aggregated in different ways, providing a natural way to express many forms of approximate and partial knowledge about the recombinant ancestry of a sample.

7.
Bioinform Adv ; 3(1): vbad163, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38033661

RESUMEN

Summary: It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM. Availability and implementation: A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.dev/tskit/docs/stable/). We also provide a user-friendly wrapper for link-ancestors in tspop, a Python-based utility package.

8.
Elife ; 122023 06 21.
Artículo en Inglés | MEDLINE | ID: mdl-37342968

RESUMEN

Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.


Asunto(s)
Genoma , Programas Informáticos , Simulación por Computador , Genética de Población , Genómica
9.
Science ; 380(6647): 849-855, 2023 05 26.
Artículo en Inglés | MEDLINE | ID: mdl-37228217

RESUMEN

Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models. Geographic features shaped migrations, and we find enrichments for migration, genetic, and genealogical relatedness patterns within river networks across regions of Quebec. Finally, we provide a freely accessible simulated whole-genome sequence dataset with spatiotemporal metadata for 1,426,749 individuals reflecting intricate French Canadian population structure. Such realistic population-scale simulations provide opportunities to investigate population genetics at an unprecedented resolution.


Asunto(s)
Conjuntos de Datos como Asunto , Linaje , Población , Humanos , Alelos , Canadá , Genética de Población , Genotipo , Quebec , Francia/etnología , Población/genética , Secuenciación Completa del Genoma , Modelos Genéticos , Migración Humana , Variación Genética
10.
Genetics ; 222(3)2022 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-36173327

RESUMEN

Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field. In particular, the important task of translating the model descriptions in published work into input suitable for population genetic simulators is labor intensive and error prone. We propose the Demes data model and file format, built on widely used technologies, to alleviate these issues. Demes provide a well-defined and unambiguous model of populations and their properties that is straightforward to implement in software, and a text file format that is designed for simplicity and clarity. We provide thoroughly tested implementations of Demes parsers in multiple languages including Python and C, and showcase initial support in several simulators and inference methods. An introduction to the file format and a detailed specification are available at https://popsim-consortium.github.io/demes-spec-docs/.


Asunto(s)
Genética de Población , Programas Informáticos , Demografía
11.
PLoS Comput Biol ; 18(3): e1009960, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-35263345

RESUMEN

We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.


Asunto(s)
Modelos Genéticos , Programas Informáticos , Algoritmos , Teorema de Bayes , Cadenas de Markov , Filogenia , Recombinación Genética/genética
12.
Science ; 375(6583): eabi8264, 2022 02 25.
Artículo en Inglés | MEDLINE | ID: mdl-35201891

RESUMEN

The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.


Asunto(s)
ADN Antiguo , Genoma Humano , Genómica , Linaje , África , Cromosomas Humanos Par 20/genética , Simulación por Computador , Bases de Datos de Ácidos Nucleicos , Conjuntos de Datos como Asunto , Evolución Molecular , Variación Genética , Genética de Población , Geografía , Haplotipos , Migración Humana , Humanos , Mutación , Análisis de Secuencia de ADN , Análisis Espacio-Temporal , Estadísticas no Paramétricas
13.
Genetics ; 220(3)2022 03 03.
Artículo en Inglés | MEDLINE | ID: mdl-34897427

RESUMEN

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.


Asunto(s)
Algoritmos , Modelos Genéticos , Simulación por Computador , Genética de Población , Mutación , Programas Informáticos
14.
Am J Hum Genet ; 107(4): 583-588, 2020 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-33007197

RESUMEN

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.


Asunto(s)
Genética de Población/tendencias , Genoma Humano , Modelos Genéticos , Programas Informáticos , Algoritmos , Simulación por Computador , Demografía , Variación Genética , Genética de Población/historia , Historia Antigua , Migración Humana/historia , Migración Humana/estadística & datos numéricos , Humanos
15.
Elife ; 92020 06 23.
Artículo en Inglés | MEDLINE | ID: mdl-32573438

RESUMEN

The explosion in population genomic data demands ever more complex modes of analysis, and increasingly, these analyses depend on sophisticated simulations. Recent advances in population genetic simulation have made it possible to simulate large and complex models, but specifying such models for a particular simulation engine remains a difficult and error-prone task. Computational genetics researchers currently re-implement simulation models independently, leading to inconsistency and duplication of effort. This situation presents a major barrier to empirical researchers seeking to use simulations for power analyses of upcoming studies or sanity checks on existing genomic data. Population genetics, as a field, also lacks standard benchmarks by which new tools for inference might be measured. Here, we describe a new resource, stdpopsim, that attempts to rectify this situation. Stdpopsim is a community-driven open source project, which provides easy access to a growing catalog of published simulation models from a range of organisms and supports multiple simulation engine backends. This resource is available as a well-documented python library with a simple command-line interface. We share some examples demonstrating how stdpopsim can be used to systematically compare demographic inference methods, and we encourage a broader community of developers to contribute to this growing resource.


Asunto(s)
Genética de Población , Biblioteca Genómica , Modelos Genéticos , Animales , Arabidopsis/genética , Perros/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Genética de Población/métodos , Genética de Población/organización & administración , Genoma/genética , Genoma Humano/genética , Humanos , Pongo abelii/genética
16.
Genetics ; 215(3): 779-797, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32357960

RESUMEN

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.


Asunto(s)
Genoma Humano , Modelos Genéticos , Linaje , Polimorfismo Genético , Evolución Molecular , Sitios Genéticos , Genética de Población/métodos , Genética de Población/normas , Humanos
17.
PLoS Genet ; 16(5): e1008619, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32369493

RESUMEN

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.


Asunto(s)
Algoritmos , Secuencia de Bases/fisiología , Genética de Población/métodos , Estudio de Asociación del Genoma Completo/métodos , Modelos Genéticos , Estudios de Cohortes , Simulación por Computador , Evolución Molecular , Genoma/genética , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Humanos , Desequilibrio de Ligamiento , Recombinación Genética/fisiología , Tamaño de la Muestra
18.
Methods Mol Biol ; 2090: 191-230, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-31975169

RESUMEN

Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest. We illustrate msprime's flexibility by implementing a simple (but functional) approximate Bayesian computation inference method in just a few tens of lines of code.


Asunto(s)
Biología Computacional/métodos , Genética de Población/métodos , Algoritmos , Teorema de Bayes , Modelos Genéticos
19.
Nat Genet ; 51(11): 1660, 2019 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-31591513

RESUMEN

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

20.
Nat Genet ; 51(9): 1330-1338, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31477934

RESUMEN

Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.


Asunto(s)
Algoritmos , Evolución Molecular , Genética de Población , Genoma Humano , Linaje , Selección Genética , Simulación por Computador , Conjuntos de Datos como Asunto , Haplotipos , Humanos , Modelos Genéticos , Mutación , Polimorfismo de Nucleótido Simple , Densidad de Población
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...