Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nature ; 625(7994): 321-328, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38200296

RESUMO

Multiple sclerosis (MS) is a neuro-inflammatory and neurodegenerative disease that is most prevalent in Northern Europe. Although it is known that inherited risk for MS is located within or in close proximity to immune-related genes, it is unknown when, where and how this genetic risk originated1. Here, by using a large ancient genome dataset from the Mesolithic period to the Bronze Age2, along with new Medieval and post-Medieval genomes, we show that the genetic risk for MS rose among pastoralists from the Pontic steppe and was brought into Europe by the Yamnaya-related migration approximately 5,000 years ago. We further show that these MS-associated immunogenetic variants underwent positive selection both within the steppe population and later in Europe, probably driven by pathogenic challenges coinciding with changes in diet, lifestyle and population density. This study highlights the critical importance of the Neolithic period and Bronze Age as determinants of modern immune responses and their subsequent effect on the risk of developing MS in a changing environment.


Assuntos
Predisposição Genética para Doença , Genoma Humano , Pradaria , Esclerose Múltipla , Humanos , Conjuntos de Dados como Assunto , Dieta/etnologia , Dieta/história , Europa (Continente)/etnologia , Predisposição Genética para Doença/história , Genética Médica , História do Século XV , História Antiga , História Medieval , Migração Humana/história , Estilo de Vida/etnologia , Estilo de Vida/história , Esclerose Múltipla/genética , Esclerose Múltipla/história , Esclerose Múltipla/imunologia , Doenças Neurodegenerativas/genética , Doenças Neurodegenerativas/história , Doenças Neurodegenerativas/imunologia , Densidade Demográfica
2.
Cell ; 157(4): 785-94, 2014 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-24813606

RESUMO

Polar bears are uniquely adapted to life in the High Arctic and have undergone drastic physiological changes in response to Arctic climates and a hyper-lipid diet of primarily marine mammal prey. We analyzed 89 complete genomes of polar bear and brown bear using population genomic modeling and show that the species diverged only 479-343 thousand years BP. We find that genes on the polar bear lineage have been under stronger positive selection than in brown bears; nine of the top 16 genes under strong positive selection are associated with cardiomyopathy and vascular disease, implying important reorganization of the cardiovascular system. One of the genes showing the strongest evidence of selection, APOB, encodes the primary lipoprotein component of low-density lipoprotein (LDL); functional mutations in APOB may explain how polar bears are able to cope with life-long elevated LDL levels that are associated with high risk of heart disease in humans.


Assuntos
Evolução Biológica , Ursidae/classificação , Ursidae/genética , Adaptação Fisiológica , Tecido Adiposo/metabolismo , Animais , Apolipoproteínas B/química , Apolipoproteínas B/metabolismo , Regiões Árticas , Ácidos Graxos/metabolismo , Fluxo Gênico , Genética Populacional , Genoma , Ursidae/fisiologia
3.
Nature ; 600(7887): 86-92, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34671161

RESUMO

During the last glacial-interglacial cycle, Arctic biotas experienced substantial climatic changes, yet the nature, extent and rate of their responses are not fully understood1-8. Here we report a large-scale environmental DNA metagenomic study of ancient plant and mammal communities, analysing 535 permafrost and lake sediment samples from across the Arctic spanning the past 50,000 years. Furthermore, we present 1,541 contemporary plant genome assemblies that were generated as reference sequences. Our study provides several insights into the long-term dynamics of the Arctic biota at the circumpolar and regional scales. Our key findings include: (1) a relatively homogeneous steppe-tundra flora dominated the Arctic during the Last Glacial Maximum, followed by regional divergence of vegetation during the Holocene epoch; (2) certain grazing animals consistently co-occurred in space and time; (3) humans appear to have been a minor factor in driving animal distributions; (4) higher effective precipitation, as well as an increase in the proportion of wetland plants, show negative effects on animal diversity; (5) the persistence of the steppe-tundra vegetation in northern Siberia enabled the late survival of several now-extinct megafauna species, including the woolly mammoth until 3.9 ± 0.2 thousand years ago (ka) and the woolly rhinoceros until 9.8 ± 0.2 ka; and (6) phylogenetic analysis of mammoth environmental DNA reveals a previously unsampled mitochondrial lineage. Our findings highlight the power of ancient environmental metagenomics analyses to advance understanding of population histories and long-term ecological dynamics.


Assuntos
Biota , DNA Antigo/análise , DNA Ambiental/análise , Metagenômica , Animais , Regiões Árticas , Mudança Climática/história , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Extinção Biológica , Sedimentos Geológicos , Pradaria , Groenlândia , Haplótipos/genética , Herbivoria/genética , História Antiga , Humanos , Lagos , Mamutes , Mitocôndrias/genética , Perissodáctilos , Pergelissolo , Filogenia , Plantas/genética , Dinâmica Populacional , Chuva , Sibéria , Análise Espaço-Temporal , Áreas Alagadas
4.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36661298

RESUMO

SUMMARY: With the rapid expansion of the capabilities of the DNA sequencers throughout the different sequencing generations, the quantity of generated data has likewise increased. This evolution has also led to new bioinformatical methods, for which in silico data have become crucial when verifying the accuracy of a model or the robustness of a genomic analysis pipeline. Here, we present a multithreaded next-generation simulator for next-generation sequencing data (NGSNGS), which simulates reads faster than currently available methods and programs. NGSNGS can simulate reads with platform-specific characteristics based on nucleotide quality score profiles as well as including a post-mortem damage model which is relevant for simulating ancient DNA. The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies or even population haplotypes and allows the user to simulate known variable sites directly. The program is implemented in a multithreading framework and is factors faster than currently available tools while extending their feature set and possible output formats. AVAILABILITY AND IMPLEMENTATION: The method and associated programs are released as open-source software, code and user manual are available at https://github.com/RAHenriksen/NGSNGS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , DNA Antigo , Análise de Sequência de DNA/métodos
5.
Mol Biol Evol ; 39(6)2022 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-35647675

RESUMO

Commonly used methods for inferring phylogenies were designed before the emergence of high-throughput sequencing and can generally not accommodate the challenges associated with noisy, diploid sequencing data. In many applications, diploid genomes are still treated as haploid through the use of ambiguity characters; while the uncertainty in genotype calling-arising as a consequence of the sequencing technology-is ignored. In order to address this problem, we describe two new probabilistic approaches for estimating genetic distances: distAngsd-geno and distAngsd-nuc, both implemented in a software suite named distAngsd. These methods are specifically designed for next-generation sequencing data, utilize the full information from the data, and take uncertainty in genotype calling into account. Through extensive simulations, we show that these new methods are markedly more accurate and have more stable statistical behaviors than other currently available methods for estimating genetic distances-even for very low depth data with high error rates.


Assuntos
Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Diploide , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Software
6.
Bioinformatics ; 38(4): 1159-1161, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34718411

RESUMO

MOTIVATION: Inference of identity-by-descent (IBD) sharing along the genome between pairs of individuals has important uses. But all existing inference methods are based on genotypes, which is not ideal for low-depth Next Generation Sequencing (NGS) data from which genotypes can only be called with high uncertainty. RESULTS: We present a new probabilistic software tool, LocalNgsRelate, for inferring IBD sharing along the genome between pairs of individuals from low-depth NGS data. Its inference is based on genotype likelihoods instead of genotypes, and thereby it takes the uncertainty of the genotype calling into account. Using real data from the 1000 Genomes project, we show that LocalNgsRelate provides more accurate IBD inference for low-depth NGS data than two state-of-the-art genotype-based methods, Albrechtsen et al. (2009) and hap-IBD. We also show that the method works well for NGS data down to a depth of 2×. AVAILABILITY AND IMPLEMENTATION: LocalNgsRelate is freely available at https://github.com/idamoltke/LocalNgsRelate. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Humanos , Genótipo , Probabilidade , Sequenciamento de Nucleotídeos em Larga Escala , Polimorfismo de Nucleotídeo Único
10.
Mol Biol Evol ; 38(7): 2750-2766, 2021 06 25.
Artigo em Inglês | MEDLINE | ID: mdl-33681996

RESUMO

The relative importance of introgression for diversification has long been a highly disputed topic in speciation research and remains an open question despite the great attention it has received over the past decade. Gene flow leaves traces in the genome similar to those created by incomplete lineage sorting (ILS), and identification and quantification of gene flow in the presence of ILS is challenging and requires knowledge about the true phylogenetic relationship among the species. We use whole nuclear, plastid, and organellar genomes from 12 species in the rapidly radiated, ecologically diverse, actively hybridizing genus of peatmoss (Sphagnum) to reconstruct the species phylogeny and quantify introgression using a suite of phylogenomic methods. We found extensive phylogenetic discordance among nuclear and organellar phylogenies, as well as across the nuclear genome and the nodes in the species tree, best explained by extensive ILS following the rapid radiation of the genus rather than by postspeciation introgression. Our analyses support the idea of ancient introgression among the ancestral lineages followed by ILS, whereas recent gene flow among the species is highly restricted despite widespread interspecific hybridization known in the group. Our results contribute to phylogenomic understanding of how speciation proceeds in rapidly radiated, actively hybridizing species groups, and demonstrate that employing a combination of diverse phylogenomic methods can facilitate untangling complex phylogenetic patterns created by ILS and introgression.


Assuntos
Fluxo Gênico , Introgressão Genética , Especiação Genética , Filogenia , Sphagnopsida/genética , Genoma de Planta , Filogeografia
11.
Bioinformatics ; 36(3): 828-841, 2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-31504166

RESUMO

MOTIVATION: The presence of present-day human contaminating DNA fragments is one of the challenges defining ancient DNA (aDNA) research. This is especially relevant to the ancient human DNA field where it is difficult to distinguish endogenous molecules from human contaminants due to their genetic similarity. Recently, with the advent of high-throughput sequencing and new aDNA protocols, hundreds of ancient human genomes have become available. Contamination in those genomes has been measured with computational methods often developed specifically for these empirical studies. Consequently, some of these methods have not been implemented and tested for general use while few are aimed at low-depth nuclear data, a common feature in aDNA datasets. RESULTS: We develop a new X-chromosome-based maximum likelihood method for estimating present-day human contamination in low-depth sequencing data from male individuals. We implement our method for general use, assess its performance under conditions typical of ancient human DNA research, and compare it to previous nuclear data-based methods through extensive simulations. For low-depth data, we show that existing methods can produce unusable estimates or substantially underestimate contamination. In contrast, our method provides accurate estimates for a depth of coverage as low as 0.5× on the X-chromosome when contamination is below 25%. Moreover, our method still yields meaningful estimates in very challenging situations, i.e. when the contaminant and the target come from closely related populations or with increased error rates. With a running time below 5 min, our method is applicable to large scale aDNA genomic studies. AVAILABILITY AND IMPLEMENTATION: The method is implemented in C++ and R and is available in github.com/sapfo/contaminationX and popgen.dk/angsd.


Assuntos
DNA Antigo , Sequenciamento de Nucleotídeos em Larga Escala , Cromossomos , Humanos , Funções Verossimilhança , Masculino , Análise de Sequência de DNA
12.
Nature ; 523(7561): 455-458, 2015 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-26087396

RESUMO

Kennewick Man, referred to as the Ancient One by Native Americans, is a male human skeleton discovered in Washington state (USA) in 1996 and initially radiocarbon dated to 8,340-9,200 calibrated years before present (BP). His population affinities have been the subject of scientific debate and legal controversy. Based on an initial study of cranial morphology it was asserted that Kennewick Man was neither Native American nor closely related to the claimant Plateau tribes of the Pacific Northwest, who claimed ancestral relationship and requested repatriation under the Native American Graves Protection and Repatriation Act (NAGPRA). The morphological analysis was important to judicial decisions that Kennewick Man was not Native American and that therefore NAGPRA did not apply. Instead of repatriation, additional studies of the remains were permitted. Subsequent craniometric analysis affirmed Kennewick Man to be more closely related to circumpacific groups such as the Ainu and Polynesians than he is to modern Native Americans. In order to resolve Kennewick Man's ancestry and affiliations, we have sequenced his genome to ∼1× coverage and compared it to worldwide genomic data including for the Ainu and Polynesians. We find that Kennewick Man is closer to modern Native Americans than to any other population worldwide. Among the Native American groups for whom genome-wide data are available for comparison, several seem to be descended from a population closely related to that of Kennewick Man, including the Confederated Tribes of the Colville Reservation (Colville), one of the five tribes claiming Kennewick Man. We revisit the cranial analyses and find that, as opposed to genome-wide comparisons, it is not possible on that basis to affiliate Kennewick Man to specific contemporary groups. We therefore conclude based on genetic comparisons that Kennewick Man shows continuity with Native North Americans over at least the last eight millennia.


Assuntos
Indígenas Norte-Americanos/genética , Filogenia , Esqueleto , América , Genoma Humano/genética , Genômica , Humanos , Masculino , Crânio/anatomia & histologia , Washington
13.
Heredity (Edinb) ; 125(1-2): 15-27, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32346130

RESUMO

Populations of the common chimpanzee (Pan troglodytes) are in an impending risk of going extinct in the wild as a consequence of damaging anthropogenic impact on their natural habitat and illegal pet and bushmeat trade. Conservation management programmes for the chimpanzee have been established outside their natural range (ex situ), and chimpanzees from these programmes could potentially be used to supplement future conservation initiatives in the wild (in situ). However, these programmes have often suffered from inadequate information about the geographical origin and subspecies ancestry of the founders. Here, we present a newly designed capture array with ~60,000 ancestry informative markers used to infer ancestry of individual chimpanzees in ex situ populations and determine geographical origin of confiscated sanctuary individuals. From a test panel of 167 chimpanzees with unknown origins or subspecies labels, we identify 90 suitable non-admixed individuals in the European Association of Zoos and Aquaria (EAZA) Ex situ Programme (EEP). Equally important, another 46 individuals have been identified with admixed subspecies ancestries, which therefore over time, should be naturally phased out of the breeding populations. With potential for future re-introduction to the wild, we determine the geographical origin of 31 individuals that were confiscated from the illegal trade and demonstrate the promises of using non-invasive sampling in future conservation action plans. Collectively, our genomic approach provides an exemplar for ex situ management of endangered species and offers an efficient tool in future in situ efforts to combat the illegal wildlife trade.


Assuntos
Conservação dos Recursos Naturais , Espécies em Perigo de Extinção , Pan troglodytes , Animais , Ecossistema , Pan troglodytes/genética
14.
Nature ; 506(7487): 225-9, 2014 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-24522598

RESUMO

Clovis, with its distinctive biface, blade and osseous technologies, is the oldest widespread archaeological complex defined in North America, dating from 11,100 to 10,700 (14)C years before present (bp) (13,000 to 12,600 calendar years bp). Nearly 50 years of archaeological research point to the Clovis complex as having developed south of the North American ice sheets from an ancestral technology. However, both the origins and the genetic legacy of the people who manufactured Clovis tools remain under debate. It is generally believed that these people ultimately derived from Asia and were directly related to contemporary Native Americans. An alternative, Solutrean, hypothesis posits that the Clovis predecessors emigrated from southwestern Europe during the Last Glacial Maximum. Here we report the genome sequence of a male infant (Anzick-1) recovered from the Anzick burial site in western Montana. The human bones date to 10,705 ± 35 (14)C years bp (approximately 12,707-12,556 calendar years bp) and were directly associated with Clovis tools. We sequenced the genome to an average depth of 14.4× and show that the gene flow from the Siberian Upper Palaeolithic Mal'ta population into Native American ancestors is also shared by the Anzick-1 individual and thus happened before 12,600 years bp. We also show that the Anzick-1 individual is more closely related to all indigenous American populations than to any other group. Our data are compatible with the hypothesis that Anzick-1 belonged to a population directly ancestral to many contemporary Native Americans. Finally, we find evidence of a deep divergence in Native American populations that predates the Anzick-1 individual.


Assuntos
Genoma Humano/genética , Indígenas Norte-Americanos/genética , Filogenia , Arqueologia , Ásia/etnologia , Osso e Ossos , Sepultamento , Cromossomos Humanos Y/genética , DNA Mitocondrial/genética , Emigração e Imigração/história , Europa (Continente)/etnologia , Fluxo Gênico/genética , Haplótipos/genética , História Antiga , Humanos , Lactente , Masculino , Modelos Genéticos , Dados de Sequência Molecular , Montana , Dinâmica Populacional , Datação Radiométrica
15.
Genome Res ; 25(4): 459-66, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25770088

RESUMO

It is commonly thought that human genetic diversity in non-African populations was shaped primarily by an out-of-Africa dispersal 50-100 thousand yr ago (kya). Here, we present a study of 456 geographically diverse high-coverage Y chromosome sequences, including 299 newly reported samples. Applying ancient DNA calibration, we date the Y-chromosomal most recent common ancestor (MRCA) in Africa at 254 (95% CI 192-307) kya and detect a cluster of major non-African founder haplogroups in a narrow time interval at 47-52 kya, consistent with a rapid initial colonization model of Eurasia and Oceania after the out-of-Africa bottleneck. In contrast to demographic reconstructions based on mtDNA, we infer a second strong bottleneck in Y-chromosome lineages dating to the last 10 ky. We hypothesize that this bottleneck is caused by cultural changes affecting variance of reproductive success among males.


Assuntos
Cromossomos Humanos Y/genética , Evolução Molecular , Grupos Raciais/genética , Sequência de Bases , DNA Mitocondrial/genética , Variação Genética/genética , Genética Populacional , Haplótipos/genética , Humanos , Masculino , Modelos Genéticos , Filogenia , Análise de Sequência de DNA
16.
Bioinformatics ; 31(24): 4009-11, 2015 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-26323718

RESUMO

MOTIVATION: Pairwise relatedness estimation is important in many contexts such as disease mapping and population genetics. However, all existing estimation methods are based on called genotypes, which is not ideal for next-generation sequencing (NGS) data of low depth from which genotypes cannot be called with high certainty. RESULTS: We present a software tool, NgsRelate, for estimating pairwise relatedness from NGS data. It provides maximum likelihood estimates that are based on genotype likelihoods instead of genotypes and thereby takes the inherent uncertainty of the genotypes into account. Using both simulated and real data, we show that NgsRelate provides markedly better estimates for low-depth NGS data than two state-of-the-art genotype-based methods. AVAILABILITY: NgsRelate is implemented in C++ and is available under the GNU license at www.popgen.dk/software.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Técnicas de Genotipagem , Humanos , Funções Verossimilhança
17.
BMC Bioinformatics ; 15: 356, 2014 Nov 25.
Artigo em Inglês | MEDLINE | ID: mdl-25420514

RESUMO

BACKGROUND: High-throughput DNA sequencing technologies are generating vast amounts of data. Fast, flexible and memory efficient implementations are needed in order to facilitate analyses of thousands of samples simultaneously. RESULTS: We present a multithreaded program suite called ANGSD. This program can calculate various summary statistics, and perform association mapping and population genetic analyses utilizing the full information in next generation sequencing data by working directly on the raw sequencing data or by using genotype likelihoods. CONCLUSIONS: The open source c/c++ program ANGSD is available at http://www.popgen.dk/angsd . The program is tested and validated on GNU/Linux systems. The program facilitates multiple input formats including BAM and imputed beagle genotype probability files. The program allow the user to choose between combinations of existing methods and can perform analysis that is not implemented elsewhere.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Frequência do Gene , Genética Populacional/métodos , Genótipo , Funções Verossimilhança , Polimorfismo de Nucleotídeo Único
18.
BMC Bioinformatics ; 14: 289, 2013 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-24088262

RESUMO

BACKGROUND: A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions. RESULTS: We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process. CONCLUSION: Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.


Assuntos
Genética Populacional/estatística & dados numéricos , Análise de Sequência de DNA/métodos , Sequência de Bases , Teorema de Bayes , Viés , Estudos de Viabilidade , Frequência do Gene/genética , Variação Genética , Genoma Humano , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Funções Verossimilhança , Polimorfismo de Nucleotídeo Único , Seleção Genética/genética
19.
Genet Epidemiol ; 36(5): 430-7, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22570057

RESUMO

The advances in sequencing technology have made large-scale sequencing studies for large cohorts feasible. Often, the primary goal for large-scale studies is to identify genetic variants associated with a disease or other phenotypes. Even when deep sequencing is performed, there will be many sites where there is not enough data to call genotypes accurately. Ignoring the genotype classification uncertainty by basing subsequent analyses on called genotypes leads to a loss in power. Additionally, using called genotypes can lead to spurious association signals. Some methods taking the uncertainty of genotype calls into account have been proposed; most require numerical optimization which for large-scale data is not always computationally feasible. We show that using a score statistic for the joint likelihood of observed phenotypes and observed sequencing data provides an attractive approach to association testing for next-generation sequencing data. The joint model accounts for the genotype classification uncertainty via the posterior probabilities of the genotypes given the observed sequencing data, which gives the approach higher power than methods based on called genotypes. This strategy remains computationally feasible due to the use of score statistics. As part of the joint likelihood, we model the distribution of the phenotypes using a generalized linear model framework, which works for both quantitative and discrete phenotypes. Thus, the method presented here is applicable to case-control studies as well as mapping of quantitative traits. The model allows additional covariates that enable correction for confounding factors such as population stratification or cohort effects.


Assuntos
Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Algoritmos , Alelos , Estudos de Casos e Controles , Reações Falso-Positivas , Genótipo , Humanos , Funções Verossimilhança , Modelos Estatísticos , Epidemiologia Molecular/métodos , Fenótipo , Probabilidade , Reprodutibilidade dos Testes
20.
Gigascience ; 112022 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-35579549

RESUMO

BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS: Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION: The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.


Assuntos
Genética Populacional , Sequenciamento de Nucleotídeos em Larga Escala , Frequência do Gene , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Funções Verossimilhança , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA