Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 40(6)2024 Jun 03.
Artículo en Inglés | MEDLINE | ID: mdl-38796683

RESUMEN

SUMMARY: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).


Asunto(s)
Estudio de Asociación del Genoma Completo , Recombinación Genética , Programas Informáticos , Estudio de Asociación del Genoma Completo/métodos , Sitios de Carácter Cuantitativo , Humanos , Genética de Población/métodos , Fenotipo , Genotipo , Simulación por Computador
2.
Am J Hum Genet ; 107(4): 583-588, 2020 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-33007197

RESUMEN

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.


Asunto(s)
Genética de Población/tendencias , Genoma Humano , Modelos Genéticos , Programas Informáticos , Algoritmos , Simulación por Computador , Demografía , Variación Genética , Genética de Población/historia , Historia Antigua , Migración Humana/historia , Migración Humana/estadística & datos numéricos , Humanos
3.
PLoS Comput Biol ; 18(3): e1009960, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-35263345

RESUMEN

We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.


Asunto(s)
Modelos Genéticos , Programas Informáticos , Algoritmos , Teorema de Bayes , Cadenas de Markov , Filogenia , Recombinación Genética/genética
4.
PLoS Genet ; 16(5): e1008619, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32369493

RESUMEN

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.


Asunto(s)
Algoritmos , Secuencia de Bases/fisiología , Genética de Población/métodos , Estudio de Asociación del Genoma Completo/métodos , Modelos Genéticos , Estudios de Cohortes , Simulación por Computador , Evolución Molecular , Genoma/genética , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Humanos , Desequilibrio de Ligamiento , Recombinación Genética/fisiología , Tamaño de la Muestra
5.
Bioinformatics ; 35(1): 119-121, 2019 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-29931085

RESUMEN

Summary: Standardized interfaces for efficiently accessing high-throughput sequencing data are a fundamental requirement for large-scale genomic data sharing. We have developed htsget, a protocol for secure, efficient and reliable access to sequencing read and variation data. We demonstrate four independent client and server implementations, and the results of a comprehensive interoperability demonstration. Availability and implementation: http://samtools.github.io/hts-specs/htsget.html. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Genoma
6.
PLoS Comput Biol ; 14(11): e1006581, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30383757

RESUMEN

In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly 'simplify' a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of N individuals with M polymorphic sites can be stored in O(N log N + M) space, making it feasible to store a simulation's entire final generation as well as its history. (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history. The software for recording and processing tree sequences is named tskit.


Asunto(s)
Biología Computacional/métodos , Variación Genética , Genética de Población , Programas Informáticos , Algoritmos , Simulación por Computador , Frecuencia de los Genes , Genoma , Genotipo , Humanos , Modelos Genéticos , Linaje , Polimorfismo Genético
7.
PLoS Comput Biol ; 12(5): e1004842, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-27145223

RESUMEN

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.


Asunto(s)
Variación Genética , Modelos Genéticos , Linaje , Algoritmos , Biología Computacional , Simulación por Computador , Evolución Molecular , Genética de Población , Humanos , Recombinación Genética , Tamaño de la Muestra
8.
Bioinformatics ; 29(7): 955-6, 2013 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-23391497

RESUMEN

UNLABELLED: Coalescent simulation has become an indispensable tool in population genetics, and many complex evolutionary scenarios have been incorporated into the basic algorithm. Despite many years of intense interest in spatial structure, however, there are no available methods to simulate the ancestry of a sample of genes that occupy a spatial continuum. This is mainly due to the severe technical problems encountered by the classical model of isolation by distance. A recently introduced model solves these technical problems and provides a solid theoretical basis for the study of populations evolving in continuous space. We present a detailed algorithm to simulate the coalescent process in this model, and provide an efficient implementation of a generalized version of this algorithm as a freely available Python module. AVAILABILITY: Package available at http://pypi.python.org/pypi/ercs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Modelos Genéticos , Simulación por Computador , Evolución Molecular , Genes , Genética de Población/métodos , Linaje , Programas Informáticos
9.
bioRxiv ; 2024 Mar 14.
Artículo en Inglés | MEDLINE | ID: mdl-38559118

RESUMEN

Summary: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation: tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact: daiki.tagami@hertford.ox.ac.uk.

10.
bioRxiv ; 2024 Mar 13.
Artículo en Inglés | MEDLINE | ID: mdl-38559261

RESUMEN

Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.

11.
bioRxiv ; 2024 Jun 12.
Artículo en Inglés | MEDLINE | ID: mdl-38915693

RESUMEN

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results: We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.

12.
BMC Bioinformatics ; 14: 356, 2013 Dec 05.
Artículo en Inglés | MEDLINE | ID: mdl-24308302

RESUMEN

BACKGROUND: Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found. RESULTS: We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats. CONCLUSIONS: Wormtable's simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Factuales , Procesamiento Automatizado de Datos/métodos , Genoma Humano , Genómica/métodos , Programas Informáticos/tendencias , Animales , Simulación por Computador , Proteínas de Drosophila/genética , Genoma de los Insectos , Genómica/instrumentación , Humanos , Bibliotecas Digitales/tendencias , Distribución Aleatoria , Motor de Búsqueda
13.
Bioinform Adv ; 3(1): vbad163, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38033661

RESUMEN

Summary: It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM. Availability and implementation: A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.dev/tskit/docs/stable/). We also provide a user-friendly wrapper for link-ancestors in tspop, a Python-based utility package.

14.
bioRxiv ; 2023 Nov 04.
Artículo en Inglés | MEDLINE | ID: mdl-37961279

RESUMEN

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). New developments have made it possible to infer ARGs at scale, enabling many new applications in population and statistical genetics. This rapid progress, however, has led to a substantial gap opening between theory and practice. Standard mathematical formalisms, based on exhaustively detailing the "events" that occur in the history of a sample, are insufficient to describe the outputs of current methods. Moreover, we argue that the underlying assumption that all events can be known and precisely estimated is fundamentally unsuited to the realities of modern, population-scale datasets. We propose an alternative mathematical formulation that encompasses the outputs of recent methods and can capture the full richness of modern large-scale datasets. By defining this ARG encoding in terms of specific genomes and their intervals of genetic inheritance, we avoid the need to exhaustively list (and estimate) all events. The effects of multiple events can be aggregated in different ways, providing a natural way to express many forms of approximate and partial knowledge about the recombinant ancestry of a sample.

15.
Science ; 380(6647): 849-855, 2023 05 26.
Artículo en Inglés | MEDLINE | ID: mdl-37228217

RESUMEN

Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models. Geographic features shaped migrations, and we find enrichments for migration, genetic, and genealogical relatedness patterns within river networks across regions of Quebec. Finally, we provide a freely accessible simulated whole-genome sequence dataset with spatiotemporal metadata for 1,426,749 individuals reflecting intricate French Canadian population structure. Such realistic population-scale simulations provide opportunities to investigate population genetics at an unprecedented resolution.


Asunto(s)
Conjuntos de Datos como Asunto , Linaje , Población , Humanos , Alelos , Canadá , Genética de Población , Genotipo , Quebec , Francia/etnología , Población/genética , Secuenciación Completa del Genoma , Modelos Genéticos , Migración Humana , Variación Genética
16.
Elife ; 122023 06 21.
Artículo en Inglés | MEDLINE | ID: mdl-37342968

RESUMEN

Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.


Asunto(s)
Genoma , Programas Informáticos , Simulación por Computador , Genética de Población , Genómica
17.
Science ; 375(6583): eabi8264, 2022 02 25.
Artículo en Inglés | MEDLINE | ID: mdl-35201891

RESUMEN

The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.


Asunto(s)
ADN Antiguo , Genoma Humano , Genómica , Linaje , África , Cromosomas Humanos Par 20/genética , Simulación por Computador , Bases de Datos de Ácidos Nucleicos , Conjuntos de Datos como Asunto , Evolución Molecular , Variación Genética , Genética de Población , Geografía , Haplotipos , Migración Humana , Humanos , Mutación , Análisis de Secuencia de ADN , Análisis Espacio-Temporal , Estadísticas no Paramétricas
18.
Genetics ; 222(3)2022 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-36173327

RESUMEN

Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field. In particular, the important task of translating the model descriptions in published work into input suitable for population genetic simulators is labor intensive and error prone. We propose the Demes data model and file format, built on widely used technologies, to alleviate these issues. Demes provide a well-defined and unambiguous model of populations and their properties that is straightforward to implement in software, and a text file format that is designed for simplicity and clarity. We provide thoroughly tested implementations of Demes parsers in multiple languages including Python and C, and showcase initial support in several simulators and inference methods. An introduction to the file format and a detailed specification are available at https://popsim-consortium.github.io/demes-spec-docs/.


Asunto(s)
Genética de Población , Programas Informáticos , Demografía
19.
Genetics ; 220(3)2022 03 03.
Artículo en Inglés | MEDLINE | ID: mdl-34897427

RESUMEN

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.


Asunto(s)
Algoritmos , Modelos Genéticos , Simulación por Computador , Genética de Población , Mutación , Programas Informáticos
20.
Methods Mol Biol ; 2090: 191-230, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-31975169

RESUMEN

Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest. We illustrate msprime's flexibility by implementing a simple (but functional) approximate Bayesian computation inference method in just a few tens of lines of code.


Asunto(s)
Biología Computacional/métodos , Genética de Población/métodos , Algoritmos , Teorema de Bayes , Modelos Genéticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA