RESUMEN
Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.
Asunto(s)
Bancos de Muestras Biológicas , Programas Informáticos , Humanos , Análisis de Componente Principal , Algoritmos , GenómicaRESUMEN
Accurate inference of population structure is important in many studies of population genetics. Here we present HaploNet, a method for performing dimensionality reduction and clustering of genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or dense genotype data. By using Gaussian mixtures in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We show that we can use haplotype clusters in the latent space to infer global population structure using haplotype information by exploiting the generative properties of our framework. Based on fitted neural networks and their latent haplotype clusters, we can perform principal component analysis and estimate ancestry proportions based on a maximum likelihood framework. Using sequencing data from simulations and closely related human populations, we show that our approach is better at distinguishing closely related populations than standard admixture and principal component analysis software. We further show that HaploNet is fast and highly scalable by applying it to genotype array data of the UK Biobank.
RESUMEN
Genomic studies of species threatened by extinction are providing crucial information about evolutionary mechanisms and genetic consequences of population declines and bottlenecks. However, to understand how species avoid the extinction vortex, insights can be drawn by studying species that thrive despite past declines. Here, we studied the population genomics of the muskox (Ovibos moschatus), an Ice Age relict that was at the brink of extinction for thousands of years at the end of the Pleistocene yet appears to be thriving today. We analysed 108 whole genomes, including present-day individuals representing the current native range of both muskox subspecies, the white-faced and the barren-ground muskox (O. moschatus wardi and O. moschatus moschatus) and a ~21,000-year-old ancient individual from Siberia. We found that the muskox' demographic history was profoundly shaped by past climate changes and post-glacial re-colonizations. In particular, the white-faced muskox has the lowest genome-wide heterozygosity recorded in an ungulate. Yet, there is no evidence of inbreeding depression in native muskox populations. We hypothesize that this can be explained by the effect of long-term gradual population declines that allowed for purging of strongly deleterious mutations. This study provides insights into how species with a history of population bottlenecks, small population sizes and low genetic diversity survive against all odds.
Asunto(s)
Metagenómica , Resiliencia Psicológica , Humanos , Animales , Recién Nacido , Evolución Biológica , Genómica , Rumiantes/genética , Variación Genética/genéticaRESUMEN
MOTIVATION: Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information. RESULTS: We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods cannot handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08×. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU's capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets. AVAILABILITY AND IMPLEMENTATION: EMU is written in Python and is freely available at https://github.com/rosemeis/emu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
BACKGROUND: Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. MATERIALS AND METHODS: We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. RESULTS: Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. CONCLUSION: We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.
Asunto(s)
Genética de Población , Secuenciación de Nucleótidos de Alto Rendimiento , Genoma , Genotipo , Humanos , Polimorfismo de Nucleótido Simple , Análisis de Componente PrincipalRESUMEN
Perturbation of lipid homoeostasis is a major risk factor for cardiovascular disease (CVD), the leading cause of death worldwide. We aimed to identify genetic variants affecting lipid levels, and thereby risk of CVD, in Greenlanders. Genome-wide association studies (GWAS) of six blood lipids, triglycerides, LDL-cholesterol, HDL-cholesterol, total cholesterol, as well as apolipoproteins A1 and B, were performed in up to 4473 Greenlanders. For genome-wide significant variants, we also tested for associations with additional traits, including CVD events. We identified 11 genome-wide significant loci associated with lipid traits. Most of these loci were already known in Europeans, however, we found a potential causal variant near PCSK9 (rs12117661), which was independent of the known PCSK9 loss-of-function variant (rs11491147). rs12117661 was associated with lower LDL-cholesterol (ßSD(SE) = -0.22 (0.03), p = 6.5 × 10-12) and total cholesterol (-0.17 (0.03), p = 1.1 × 10-8) in the Greenlandic study population. Similar associations were observed in Europeans from the UK Biobank, where the variant was also associated with a lower risk of CVD outcomes. Moreover, rs12117661 was a top eQTL for PCSK9 across tissues in European data from the GTEx portal, and was located in a predicted regulatory element, supporting a possible causal impact on PCSK9 expression. Combined, the 11 GWAS signals explained up to 16.3% of the variance of the lipid traits. This suggests that the genetic architecture of lipid levels in Greenlanders is different from Europeans, with fewer variants explaining the variance.
Asunto(s)
Enfermedades Cardiovasculares , Estudio de Asociación del Genoma Completo , Humanos , Proproteína Convertasa 9/genética , Groenlandia , Triglicéridos/genética , Lípidos/genética , HDL-Colesterol , LDL-Colesterol/genética , LDL-Colesterol/metabolismo , Enfermedades Cardiovasculares/genética , Polimorfismo de Nucleótido SimpleRESUMEN
Strong genetic structure has prompted discussion regarding giraffe taxonomy,1,2,3 including a suggestion to split the giraffe into four species: Northern (Giraffa c. camelopardalis), Reticulated (G. c. reticulata), Masai (G. c. tippelskirchi), and Southern giraffes (G. c. giraffa).4,5,6 However, their evolutionary history is not yet fully resolved, as previous studies used a simple bifurcating model and did not explore the presence or extent of gene flow between lineages. We therefore inferred a model that incorporates various evolutionary processes to assess the drivers of contemporary giraffe diversity. We analyzed whole-genome sequencing data from 90 wild giraffes from 29 localities across their current distribution. The most basal divergence was dated to 280 kya. Genetic differentiation, FST, among major lineages ranged between 0.28 and 0.62, and we found significant levels of ancient gene flow between them. In particular, several analyses suggested that the Reticulated lineage evolved through admixture, with almost equal contribution from the Northern lineage and an ancestral lineage related to Masai and Southern giraffes. These new results highlight a scenario of strong differentiation despite gene flow, providing further context for the interpretation of giraffe diversity and the process of speciation in general. They also illustrate that conservation measures need to target various lineages and sublineages and that separate management strategies are needed to conserve giraffe diversity effectively. Given local extinctions and recent dramatic declines in many giraffe populations, this improved understanding of giraffe evolutionary history is relevant for conservation interventions, including reintroductions and reinforcements of existing populations.
Asunto(s)
Jirafas , Animales , Jirafas/genética , Rumiantes/genética , Evolución Biológica , Filogenia , Flujo GenéticoRESUMEN
The blue wildebeest (Connochaetes taurinus) is a keystone species in savanna ecosystems from southern to eastern Africa, and is well known for its spectacular migrations and locally extreme abundance. In contrast, the black wildebeest (C. gnou) is endemic to southern Africa, barely escaped extinction in the 1900s and is feared to be in danger of genetic swamping from the blue wildebeest. Despite the ecological importance of the wildebeest, there is a lack of understanding of how its unique migratory ecology has affected its gene flow, genetic structure and phylogeography. Here, we analyze whole genomes from 121 blue and 22 black wildebeest across the genus' range. We find discrete genetic structure consistent with the morphologically defined subspecies. Unexpectedly, our analyses reveal no signs of recent interspecific admixture, but rather a late Pleistocene introgression of black wildebeest into the southern blue wildebeest populations. Finally, we find that migratory blue wildebeest populations exhibit a combination of long-range panmixia, higher genetic diversity and lower inbreeding levels compared to neighboring populations whose migration has recently been disrupted. These findings provide crucial insights into the evolutionary history of the wildebeest, and tangible genetic evidence for the negative effects of anthropogenic activities on highly migratory ungulates.
Asunto(s)
Antílopes , Animales , Antílopes/genética , Ecosistema , África Oriental , África Austral , Efectos AntropogénicosRESUMEN
Genotyping-by-sequencing methods such as RADseq are popular for generating genomic and population-scale data sets from a diverse range of organisms. These often lack a usable reference genome, restricting users to RADseq specific software for processing. However, these come with limitations compared to generic next generation sequencing (NGS) toolkits. Here, we describe and test a simple pipeline for reference-free RADseq data processing that blends de novo elements from STACKS with the full suite of state-of-the art NGS tools. Specifically, we use the de novo RADseq assembly employed by STACKS to create a catalogue of RAD loci that serves as a reference for read mapping, variant calling and site filters. Using RADseq data from 28 zebra sequenced to ~8x depth-of-coverage we evaluate our approach by comparing the site frequency spectra (SFS) to those from alternative pipelines. Most pipelines yielded similar SFS at 8x depth, but only a genotype likelihood based pipeline performed similarly at low sequencing depth (2-4x). We compared the RADseq SFS with medium-depth (~13x) shotgun sequencing of eight overlapping samples, revealing that the RADseq SFS was persistently slightly skewed towards rare and invariant alleles. Using simulations and human data we confirm that this is expected when there is allelic dropout (AD) in the RADseq data. AD in the RADseq data caused a heterozygosity deficit of ~16%, which dropped to ~5% after filtering AD. Hence, AD was the most important source of bias in our RADseq data.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Programas Informáticos , Animales , Equidae/genética , Genómica , Humanos , Funciones de Verosimilitud , Pérdida de Heterocigocidad , Polimorfismo de Nucleótido SimpleRESUMEN
Large carnivores are generally sensitive to ecosystem changes because their specialized diet and position at the top of the trophic pyramid is associated with small population sizes. Accordingly, low genetic diversity at the whole-genome level has been reported for all big cat species, including the widely distributed leopard. However, all previous whole-genome analyses of leopards are based on the Far Eastern Amur leopards that live at the extremity of the species' distribution and therefore are not necessarily representative of the whole species. We sequenced 53 whole genomes of African leopards. Strikingly, we found that the genomic diversity in the African leopard is 2- to 5-fold higher than in other big cats, including the Amur leopard, likely because of an exceptionally high effective population size maintained by the African leopard throughout the Pleistocene. Furthermore, we detected ongoing gene flow and very low population differentiation within African leopards compared with those of other big cats. We corroborated this by showing a complete absence of an otherwise ubiquitous equatorial forest barrier to gene flow. This sets the leopard apart from most other widely distributed large African mammals, including lions. These results revise our understanding of trophic sensitivity and highlight the remarkable resilience of the African leopard, likely because of its extraordinary habitat versatility and broad dietary niche.
Asunto(s)
Ecosistema , Variación Genética , Panthera/anatomía & histología , Panthera/genética , África , Animales , Femenino , Flujo Génico , Masculino , Panthera/clasificación , Densidad de PoblaciónRESUMEN
Testing for deviations from Hardy-Weinberg equilibrium (HWE) is a common practice for quality control in genetic studies. Variable sites violating HWE may be identified as technical errors in the sequencing or genotyping process, or they may be of particular evolutionary interest. Large-scale genetic studies based on next-generation sequencing (NGS) methods have become more prevalent as cost is decreasing but these methods are still associated with statistical uncertainty. The large-scale studies usually consist of samples from diverse ancestries that make the existence of some degree of population structure almost inevitable. Precautions are therefore needed when analysing these data set, as population structure causes deviations from HWE. Here we propose a method that takes population structure into account in the testing for HWE, such that other factors causing deviations from HWE can be detected. We show the effectiveness of PCAngsd in low-depth NGS data, as well as in genotype data, for both simulated and real data set, where the use of genotype likelihoods enables us to model the uncertainty.
Asunto(s)
Genética de Población/métodos , Técnicas de Genotipaje/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genética de Población/normas , Técnicas de Genotipaje/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Control de CalidadRESUMEN
Capsicum is one of the major vegetable crops grown worldwide. Current subdivision in clades and species is based on morphological traits and coarse sets of genetic markers. Broad variability of fruits has been driven by breeding programs and has been mainly studied by linkage analysis. We discovered 746k variable sites by sequencing 1.8% of the genome in a collection of 373 accessions belonging to 11 Capsicum species from 51 countries. We describe genomic variation at population-level, confirm major subdivision in clades and species, and show that the known major subdivision of C. annuum separates large and bulky fruits from small ones. In C. annuum, we identify four novel loci associated with phenotypes determining the fruit shape, including a non-synonymous mutation in the gene Longifolia 1-like (CA03g16080). Our collection covers all the economically important species of Capsicum widely used in breeding programs and represent the widest and largest study so far in terms of the number of species and number of genetic variants analyzed. We identified a large set of markers that can be used for population genetic studies and genetic association analyses. Our results provide a comprehensive and precise perspective on genomic variability in Capsicum at population-level and suggest that future fine genetic association studies will yield useful results for breeding.
Asunto(s)
Capsicum/genética , Frutas/anatomía & histología , Tamaño de los Órganos/genética , Proteínas de Arabidopsis/genética , Variación Genética , Genoma , Estudio de Asociación del Genoma Completo , Fitomejoramiento , Polimorfismo GenéticoRESUMEN
We here present two methods for inferring population structure and admixture proportions in low-depth next-generation sequencing (NGS) data. Inference of population structure is essential in both population genetics and association studies, and is often performed using principal component analysis (PCA) or clustering-based approaches. NGS methods provide large amounts of genetic data but are associated with statistical uncertainty, especially for low-depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through PCA in an iterative heuristic approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.