Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 29
Filter
Add more filters










Publication year range
1.
Nature ; 625(7994): 321-328, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38200296

ABSTRACT

Multiple sclerosis (MS) is a neuro-inflammatory and neurodegenerative disease that is most prevalent in Northern Europe. Although it is known that inherited risk for MS is located within or in close proximity to immune-related genes, it is unknown when, where and how this genetic risk originated1. Here, by using a large ancient genome dataset from the Mesolithic period to the Bronze Age2, along with new Medieval and post-Medieval genomes, we show that the genetic risk for MS rose among pastoralists from the Pontic steppe and was brought into Europe by the Yamnaya-related migration approximately 5,000 years ago. We further show that these MS-associated immunogenetic variants underwent positive selection both within the steppe population and later in Europe, probably driven by pathogenic challenges coinciding with changes in diet, lifestyle and population density. This study highlights the critical importance of the Neolithic period and Bronze Age as determinants of modern immune responses and their subsequent effect on the risk of developing MS in a changing environment.


Subject(s)
Genetic Predisposition to Disease , Genome, Human , Grassland , Multiple Sclerosis , Humans , Datasets as Topic , Diet/ethnology , Diet/history , Europe/ethnology , Genetic Predisposition to Disease/history , Genetics, Medical , History, 15th Century , History, Ancient , History, Medieval , Human Migration/history , Life Style/ethnology , Life Style/history , Multiple Sclerosis/genetics , Multiple Sclerosis/history , Multiple Sclerosis/immunology , Neurodegenerative Diseases/genetics , Neurodegenerative Diseases/history , Neurodegenerative Diseases/immunology , Population Density
2.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36661298

ABSTRACT

SUMMARY: With the rapid expansion of the capabilities of the DNA sequencers throughout the different sequencing generations, the quantity of generated data has likewise increased. This evolution has also led to new bioinformatical methods, for which in silico data have become crucial when verifying the accuracy of a model or the robustness of a genomic analysis pipeline. Here, we present a multithreaded next-generation simulator for next-generation sequencing data (NGSNGS), which simulates reads faster than currently available methods and programs. NGSNGS can simulate reads with platform-specific characteristics based on nucleotide quality score profiles as well as including a post-mortem damage model which is relevant for simulating ancient DNA. The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies or even population haplotypes and allows the user to simulate known variable sites directly. The program is implemented in a multithreading framework and is factors faster than currently available tools while extending their feature set and possible output formats. AVAILABILITY AND IMPLEMENTATION: The method and associated programs are released as open-source software, code and user manual are available at https://github.com/RAHenriksen/NGSNGS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Software , Genomics , High-Throughput Nucleotide Sequencing/methods , DNA, Ancient , Sequence Analysis, DNA/methods
5.
Genetics ; 222(4)2022 11 30.
Article in English | MEDLINE | ID: mdl-36173322

ABSTRACT

The site frequency spectrum is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the site frequency spectrum from called genotypes introduces bias when working with low-coverage sequencing data. Methods exist for addressing this issue but sometimes suffer from 2 problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multidimensional site frequency spectrum estimation. In this article, we present a stochastic expectation-maximization algorithm for inferring the site frequency spectrum from NGS data that address these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Furthermore, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.


Subject(s)
Algorithms , Genetics, Population , Genotype , Genome , Bias , High-Throughput Nucleotide Sequencing/methods
6.
Mol Biol Evol ; 39(6)2022 06 02.
Article in English | MEDLINE | ID: mdl-35647675

ABSTRACT

Commonly used methods for inferring phylogenies were designed before the emergence of high-throughput sequencing and can generally not accommodate the challenges associated with noisy, diploid sequencing data. In many applications, diploid genomes are still treated as haploid through the use of ambiguity characters; while the uncertainty in genotype calling-arising as a consequence of the sequencing technology-is ignored. In order to address this problem, we describe two new probabilistic approaches for estimating genetic distances: distAngsd-geno and distAngsd-nuc, both implemented in a software suite named distAngsd. These methods are specifically designed for next-generation sequencing data, utilize the full information from the data, and take uncertainty in genotype calling into account. Through extensive simulations, we show that these new methods are markedly more accurate and have more stable statistical behaviors than other currently available methods for estimating genetic distances-even for very low depth data with high error rates.


Subject(s)
Genome , High-Throughput Nucleotide Sequencing , Algorithms , Diploidy , Genotype , High-Throughput Nucleotide Sequencing/methods , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Software
7.
Gigascience ; 112022 05 17.
Article in English | MEDLINE | ID: mdl-35579549

ABSTRACT

BACKGROUND: The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. RESULTS: Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. CONCLUSION: The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.


Subject(s)
Genetics, Population , High-Throughput Nucleotide Sequencing , Gene Frequency , Genotype , High-Throughput Nucleotide Sequencing/methods , Humans , Likelihood Functions , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods
9.
Bioinformatics ; 38(4): 1159-1161, 2022 01 27.
Article in English | MEDLINE | ID: mdl-34718411

ABSTRACT

MOTIVATION: Inference of identity-by-descent (IBD) sharing along the genome between pairs of individuals has important uses. But all existing inference methods are based on genotypes, which is not ideal for low-depth Next Generation Sequencing (NGS) data from which genotypes can only be called with high uncertainty. RESULTS: We present a new probabilistic software tool, LocalNgsRelate, for inferring IBD sharing along the genome between pairs of individuals from low-depth NGS data. Its inference is based on genotype likelihoods instead of genotypes, and thereby it takes the uncertainty of the genotype calling into account. Using real data from the 1000 Genomes project, we show that LocalNgsRelate provides more accurate IBD inference for low-depth NGS data than two state-of-the-art genotype-based methods, Albrechtsen et al. (2009) and hap-IBD. We also show that the method works well for NGS data down to a depth of 2×. AVAILABILITY AND IMPLEMENTATION: LocalNgsRelate is freely available at https://github.com/idamoltke/LocalNgsRelate. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Software , Humans , Genotype , Probability , High-Throughput Nucleotide Sequencing , Polymorphism, Single Nucleotide
10.
Nature ; 600(7887): 86-92, 2021 12.
Article in English | MEDLINE | ID: mdl-34671161

ABSTRACT

During the last glacial-interglacial cycle, Arctic biotas experienced substantial climatic changes, yet the nature, extent and rate of their responses are not fully understood1-8. Here we report a large-scale environmental DNA metagenomic study of ancient plant and mammal communities, analysing 535 permafrost and lake sediment samples from across the Arctic spanning the past 50,000 years. Furthermore, we present 1,541 contemporary plant genome assemblies that were generated as reference sequences. Our study provides several insights into the long-term dynamics of the Arctic biota at the circumpolar and regional scales. Our key findings include: (1) a relatively homogeneous steppe-tundra flora dominated the Arctic during the Last Glacial Maximum, followed by regional divergence of vegetation during the Holocene epoch; (2) certain grazing animals consistently co-occurred in space and time; (3) humans appear to have been a minor factor in driving animal distributions; (4) higher effective precipitation, as well as an increase in the proportion of wetland plants, show negative effects on animal diversity; (5) the persistence of the steppe-tundra vegetation in northern Siberia enabled the late survival of several now-extinct megafauna species, including the woolly mammoth until 3.9 ± 0.2 thousand years ago (ka) and the woolly rhinoceros until 9.8 ± 0.2 ka; and (6) phylogenetic analysis of mammoth environmental DNA reveals a previously unsampled mitochondrial lineage. Our findings highlight the power of ancient environmental metagenomics analyses to advance understanding of population histories and long-term ecological dynamics.


Subject(s)
Biota , DNA, Ancient/analysis , DNA, Environmental/analysis , Metagenomics , Animals , Arctic Regions , Climate Change/history , Databases, Genetic , Datasets as Topic , Extinction, Biological , Geologic Sediments , Grassland , Greenland , Haplotypes/genetics , Herbivory/genetics , History, Ancient , Humans , Lakes , Mammoths , Mitochondria/genetics , Perissodactyla , Permafrost , Phylogeny , Plants/genetics , Population Dynamics , Rain , Siberia , Spatio-Temporal Analysis , Wetlands
11.
Sci Adv ; 7(44): eabh2013, 2021 Oct 29.
Article in English | MEDLINE | ID: mdl-34705496

ABSTRACT

A great-grandson of the legendary Lakota Sioux leader Sitting Bull (Tatanka Iyotake), Ernie LaPointe, wished to have their familial relationship confirmed via genetic analysis, in part, to help settle concerns over Sitting Bull's final resting place. To address Ernie LaPointe's claim of family relationship, we obtained minor amounts of genomic data from a small piece of hair from Sitting Bull's scalp lock, which was repatriated in 2007. We then compared these data to genome-wide data from LaPointe and other Lakota Sioux using a new probabilistic approach and concluded that Ernie LaPointe is Sitting Bull's great-grandson. To our knowledge, this is the first published example of a familial relationship between contemporary and a historical individual that has been confirmed using such limited amounts of ancient DNA across such distant relatives. Hence, this study opens the possibility for broadening genealogical research, even when only minor amounts of ancient genetic material are accessible.

12.
Mol Biol Evol ; 38(7): 2750-2766, 2021 06 25.
Article in English | MEDLINE | ID: mdl-33681996

ABSTRACT

The relative importance of introgression for diversification has long been a highly disputed topic in speciation research and remains an open question despite the great attention it has received over the past decade. Gene flow leaves traces in the genome similar to those created by incomplete lineage sorting (ILS), and identification and quantification of gene flow in the presence of ILS is challenging and requires knowledge about the true phylogenetic relationship among the species. We use whole nuclear, plastid, and organellar genomes from 12 species in the rapidly radiated, ecologically diverse, actively hybridizing genus of peatmoss (Sphagnum) to reconstruct the species phylogeny and quantify introgression using a suite of phylogenomic methods. We found extensive phylogenetic discordance among nuclear and organellar phylogenies, as well as across the nuclear genome and the nodes in the species tree, best explained by extensive ILS following the rapid radiation of the genus rather than by postspeciation introgression. Our analyses support the idea of ancient introgression among the ancestral lineages followed by ILS, whereas recent gene flow among the species is highly restricted despite widespread interspecific hybridization known in the group. Our results contribute to phylogenomic understanding of how speciation proceeds in rapidly radiated, actively hybridizing species groups, and demonstrate that employing a combination of diverse phylogenomic methods can facilitate untangling complex phylogenetic patterns created by ILS and introgression.


Subject(s)
Gene Flow , Genetic Introgression , Genetic Speciation , Phylogeny , Sphagnopsida/genetics , Genome, Plant , Phylogeography
13.
Mol Ecol Resour ; 21(4): 1085-1097, 2021 May.
Article in English | MEDLINE | ID: mdl-33434329

ABSTRACT

Genotyping-by-sequencing methods such as RADseq are popular for generating genomic and population-scale data sets from a diverse range of organisms. These often lack a usable reference genome, restricting users to RADseq specific software for processing. However, these come with limitations compared to generic next generation sequencing (NGS) toolkits. Here, we describe and test a simple pipeline for reference-free RADseq data processing that blends de novo elements from STACKS with the full suite of state-of-the art NGS tools. Specifically, we use the de novo RADseq assembly employed by STACKS to create a catalogue of RAD loci that serves as a reference for read mapping, variant calling and site filters. Using RADseq data from 28 zebra sequenced to ~8x depth-of-coverage we evaluate our approach by comparing the site frequency spectra (SFS) to those from alternative pipelines. Most pipelines yielded similar SFS at 8x depth, but only a genotype likelihood based pipeline performed similarly at low sequencing depth (2-4x). We compared the RADseq SFS with medium-depth (~13x) shotgun sequencing of eight overlapping samples, revealing that the RADseq SFS was persistently slightly skewed towards rare and invariant alleles. Using simulations and human data we confirm that this is expected when there is allelic dropout (AD) in the RADseq data. AD in the RADseq data caused a heterozygosity deficit of ~16%, which dropped to ~5% after filtering AD. Hence, AD was the most important source of bias in our RADseq data.


Subject(s)
High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA , Software , Animals , Equidae/genetics , Genomics , Humans , Likelihood Functions , Loss of Heterozygosity , Polymorphism, Single Nucleotide
14.
Heredity (Edinb) ; 125(1-2): 15-27, 2020 08.
Article in English | MEDLINE | ID: mdl-32346130

ABSTRACT

Populations of the common chimpanzee (Pan troglodytes) are in an impending risk of going extinct in the wild as a consequence of damaging anthropogenic impact on their natural habitat and illegal pet and bushmeat trade. Conservation management programmes for the chimpanzee have been established outside their natural range (ex situ), and chimpanzees from these programmes could potentially be used to supplement future conservation initiatives in the wild (in situ). However, these programmes have often suffered from inadequate information about the geographical origin and subspecies ancestry of the founders. Here, we present a newly designed capture array with ~60,000 ancestry informative markers used to infer ancestry of individual chimpanzees in ex situ populations and determine geographical origin of confiscated sanctuary individuals. From a test panel of 167 chimpanzees with unknown origins or subspecies labels, we identify 90 suitable non-admixed individuals in the European Association of Zoos and Aquaria (EAZA) Ex situ Programme (EEP). Equally important, another 46 individuals have been identified with admixed subspecies ancestries, which therefore over time, should be naturally phased out of the breeding populations. With potential for future re-introduction to the wild, we determine the geographical origin of 31 individuals that were confiscated from the illegal trade and demonstrate the promises of using non-invasive sampling in future conservation action plans. Collectively, our genomic approach provides an exemplar for ex situ management of endangered species and offers an efficient tool in future in situ efforts to combat the illegal wildlife trade.


Subject(s)
Conservation of Natural Resources , Endangered Species , Pan troglodytes , Animals , Ecosystem , Pan troglodytes/genetics
15.
Bioinformatics ; 36(3): 828-841, 2020 02 01.
Article in English | MEDLINE | ID: mdl-31504166

ABSTRACT

MOTIVATION: The presence of present-day human contaminating DNA fragments is one of the challenges defining ancient DNA (aDNA) research. This is especially relevant to the ancient human DNA field where it is difficult to distinguish endogenous molecules from human contaminants due to their genetic similarity. Recently, with the advent of high-throughput sequencing and new aDNA protocols, hundreds of ancient human genomes have become available. Contamination in those genomes has been measured with computational methods often developed specifically for these empirical studies. Consequently, some of these methods have not been implemented and tested for general use while few are aimed at low-depth nuclear data, a common feature in aDNA datasets. RESULTS: We develop a new X-chromosome-based maximum likelihood method for estimating present-day human contamination in low-depth sequencing data from male individuals. We implement our method for general use, assess its performance under conditions typical of ancient human DNA research, and compare it to previous nuclear data-based methods through extensive simulations. For low-depth data, we show that existing methods can produce unusable estimates or substantially underestimate contamination. In contrast, our method provides accurate estimates for a depth of coverage as low as 0.5× on the X-chromosome when contamination is below 25%. Moreover, our method still yields meaningful estimates in very challenging situations, i.e. when the contaminant and the target come from closely related populations or with increased error rates. With a running time below 5 min, our method is applicable to large scale aDNA genomic studies. AVAILABILITY AND IMPLEMENTATION: The method is implemented in C++ and R and is available in github.com/sapfo/contaminationX and popgen.dk/angsd.


Subject(s)
DNA, Ancient , High-Throughput Nucleotide Sequencing , Chromosomes , Humans , Likelihood Functions , Male , Sequence Analysis, DNA
16.
Genetics ; 212(3): 587-614, 2019 07.
Article in English | MEDLINE | ID: mdl-31088861

ABSTRACT

Both the total amount and the distribution of heterozygous sites within individual genomes are informative about the genetic diversity of the population they belong to. Detecting true heterozygous sites in ancient genomes is complicated by the generally limited coverage achieved and the presence of post-mortem damage inflating sequencing errors. Additionally, large runs of homozygosity found in the genomes of particularly inbred individuals and of domestic animals can skew estimates of genome-wide heterozygosity rates. Current computational tools aimed at estimating runs of homozygosity and genome-wide heterozygosity levels are generally sensitive to such limitations. Here, we introduce ROHan, a probabilistic method which substantially improves the estimate of heterozygosity rates both genome-wide and for genomic local windows. It combines a local Bayesian model and a Hidden Markov Model at the genome-wide level and can work both on modern and ancient samples. We show that our algorithm outperforms currently available methods for predicting heterozygosity rates for ancient samples. Specifically, ROHan can delineate large runs of homozygosity (at megabase scales) and produce a reliable confidence interval for the genome-wide rate of heterozygosity outside of such regions from modern genomes with a depth of coverage as low as 5-6× and down to 7-8× for ancient samples showing moderate DNA damage. We apply ROHan to a series of modern and ancient genomes previously published and revise available estimates of heterozygosity for humans, chimpanzees and horses.


Subject(s)
DNA, Ancient , Genotyping Techniques/methods , Heterozygote , Homozygote , Animals , Bayes Theorem , Genotyping Techniques/standards , Humans , Markov Chains
17.
Gigascience ; 8(5)2019 05 01.
Article in English | MEDLINE | ID: mdl-31042285

ABSTRACT

BACKGROUND: The estimation of relatedness between pairs of possibly inbred individuals from high-throughput sequencing (HTS) data has previously not been possible for samples where we cannot obtain reliable genotype calls, as in the case of low-coverage data. RESULTS: We introduce ngsRelateV2, a major revision of ngsRelateV1, a program that originally allowed for estimation of relatedness from HTS data among non-inbred individuals only. The new revised version takes into account the possibility of individuals being inbred by estimating the 9 condensed Jacquard coefficients along with various other relatedness statistics. The program is threaded and scales linearly with the number of cores allocated to the process. CONCLUSION: The program is available as an open source C/C++ program under the GPL license and hosted at https://github.com/ANGSD/ngsRelate. To facilitate easy analysis, the program is able to work directly on the most commonly used container formats for raw sequence (BAM/CRAM) and summary data (VCF/BCF).


Subject(s)
Genetics, Population , Genotyping Techniques , High-Throughput Nucleotide Sequencing , Inbreeding , Genotype , Humans , Polymorphism, Single Nucleotide/genetics , Sequence Analysis, DNA , Software
18.
Bioinformatics ; 31(24): 4009-11, 2015 Dec 15.
Article in English | MEDLINE | ID: mdl-26323718

ABSTRACT

MOTIVATION: Pairwise relatedness estimation is important in many contexts such as disease mapping and population genetics. However, all existing estimation methods are based on called genotypes, which is not ideal for next-generation sequencing (NGS) data of low depth from which genotypes cannot be called with high certainty. RESULTS: We present a software tool, NgsRelate, for estimating pairwise relatedness from NGS data. It provides maximum likelihood estimates that are based on genotype likelihoods instead of genotypes and thereby takes the inherent uncertainty of the genotypes into account. Using both simulated and real data, we show that NgsRelate provides markedly better estimates for low-depth NGS data than two state-of-the-art genotype-based methods. AVAILABILITY: NgsRelate is implemented in C++ and is available under the GNU license at www.popgen.dk/software.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Software , Genotyping Techniques , Humans , Likelihood Functions
19.
Nature ; 523(7561): 455-458, 2015 Jul 23.
Article in English | MEDLINE | ID: mdl-26087396

ABSTRACT

Kennewick Man, referred to as the Ancient One by Native Americans, is a male human skeleton discovered in Washington state (USA) in 1996 and initially radiocarbon dated to 8,340-9,200 calibrated years before present (BP). His population affinities have been the subject of scientific debate and legal controversy. Based on an initial study of cranial morphology it was asserted that Kennewick Man was neither Native American nor closely related to the claimant Plateau tribes of the Pacific Northwest, who claimed ancestral relationship and requested repatriation under the Native American Graves Protection and Repatriation Act (NAGPRA). The morphological analysis was important to judicial decisions that Kennewick Man was not Native American and that therefore NAGPRA did not apply. Instead of repatriation, additional studies of the remains were permitted. Subsequent craniometric analysis affirmed Kennewick Man to be more closely related to circumpacific groups such as the Ainu and Polynesians than he is to modern Native Americans. In order to resolve Kennewick Man's ancestry and affiliations, we have sequenced his genome to ∼1× coverage and compared it to worldwide genomic data including for the Ainu and Polynesians. We find that Kennewick Man is closer to modern Native Americans than to any other population worldwide. Among the Native American groups for whom genome-wide data are available for comparison, several seem to be descended from a population closely related to that of Kennewick Man, including the Confederated Tribes of the Colville Reservation (Colville), one of the five tribes claiming Kennewick Man. We revisit the cranial analyses and find that, as opposed to genome-wide comparisons, it is not possible on that basis to affiliate Kennewick Man to specific contemporary groups. We therefore conclude based on genetic comparisons that Kennewick Man shows continuity with Native North Americans over at least the last eight millennia.


Subject(s)
Indians, North American/genetics , Phylogeny , Skeleton , Americas , Genome, Human/genetics , Genomics , Humans , Male , Skull/anatomy & histology , Washington
20.
Genome Res ; 25(4): 459-66, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25770088

ABSTRACT

It is commonly thought that human genetic diversity in non-African populations was shaped primarily by an out-of-Africa dispersal 50-100 thousand yr ago (kya). Here, we present a study of 456 geographically diverse high-coverage Y chromosome sequences, including 299 newly reported samples. Applying ancient DNA calibration, we date the Y-chromosomal most recent common ancestor (MRCA) in Africa at 254 (95% CI 192-307) kya and detect a cluster of major non-African founder haplogroups in a narrow time interval at 47-52 kya, consistent with a rapid initial colonization model of Eurasia and Oceania after the out-of-Africa bottleneck. In contrast to demographic reconstructions based on mtDNA, we infer a second strong bottleneck in Y-chromosome lineages dating to the last 10 ky. We hypothesize that this bottleneck is caused by cultural changes affecting variance of reproductive success among males.


Subject(s)
Chromosomes, Human, Y/genetics , Evolution, Molecular , Racial Groups/genetics , Base Sequence , DNA, Mitochondrial/genetics , Genetic Variation/genetics , Genetics, Population , Haplotypes/genetics , Humans , Male , Models, Genetic , Phylogeny , Sequence Analysis, DNA
SELECTION OF CITATIONS
SEARCH DETAIL
...