RESUMO
During the last glacial-interglacial cycle, Arctic biotas experienced substantial climatic changes, yet the nature, extent and rate of their responses are not fully understood1-8. Here we report a large-scale environmental DNA metagenomic study of ancient plant and mammal communities, analysing 535 permafrost and lake sediment samples from across the Arctic spanning the past 50,000 years. Furthermore, we present 1,541 contemporary plant genome assemblies that were generated as reference sequences. Our study provides several insights into the long-term dynamics of the Arctic biota at the circumpolar and regional scales. Our key findings include: (1) a relatively homogeneous steppe-tundra flora dominated the Arctic during the Last Glacial Maximum, followed by regional divergence of vegetation during the Holocene epoch; (2) certain grazing animals consistently co-occurred in space and time; (3) humans appear to have been a minor factor in driving animal distributions; (4) higher effective precipitation, as well as an increase in the proportion of wetland plants, show negative effects on animal diversity; (5) the persistence of the steppe-tundra vegetation in northern Siberia enabled the late survival of several now-extinct megafauna species, including the woolly mammoth until 3.9 ± 0.2 thousand years ago (ka) and the woolly rhinoceros until 9.8 ± 0.2 ka; and (6) phylogenetic analysis of mammoth environmental DNA reveals a previously unsampled mitochondrial lineage. Our findings highlight the power of ancient environmental metagenomics analyses to advance understanding of population histories and long-term ecological dynamics.
Assuntos
Biota , DNA Antigo/análise , DNA Ambiental/análise , Metagenômica , Animais , Regiões Árticas , Mudança Climática/história , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Extinção Biológica , Sedimentos Geológicos , Pradaria , Groenlândia , Haplótipos/genética , Herbivoria/genética , História Antiga , Humanos , Lagos , Mamutes , Mitocôndrias/genética , Perissodáctilos , Pergelissolo , Filogenia , Plantas/genética , Dinâmica Populacional , Chuva , Sibéria , Análise Espaço-Temporal , Áreas AlagadasRESUMO
BACKGROUND: We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of the LRP algorithm implemented in AlphaPhase could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Furthermore, the AlphaPhase implementations of LRP and HLI were not designed to deal with large amounts of missing data that are inherent when using multiple SNP arrays. METHODS: We developed methods that avoid the need for all-against-all searches by performing LRP on subsets of individuals and then concatenating the results. We also extended LRP and HLI algorithms to enable the use of different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of AlphaPhase, and compared its performance to the software package Eagle2. RESULTS: A simulated dataset with one million individuals genotyped with the same 6711 SNPs for a single chromosome took less than a day to phase, compared to more than seven days for Eagle2. The percentage of correctly phased alleles at heterozygous loci was 90.2 and 99.9% for AlphaPhase and Eagle2, respectively. A larger dataset with one million individuals genotyped with 49,579 SNPs for a single chromosome took AlphaPhase 23 days to phase, with 89.9% of alleles at heterozygous loci phased correctly. The phasing accuracy was generally lower for datasets with different sets of markers than with one set of markers. For a simulated dataset with three sets of markers, 1.5% of alleles at heterozygous positions were phased incorrectly, compared to 0.4% with one set of markers. CONCLUSIONS: The improved LRP and HLI algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. AlphaPhase is an order of magnitude faster than the other tested packages, although Eagle2 showed a higher level of phasing accuracy. The speed gain will make phasing achievable for very large genomic datasets in livestock, enabling more powerful breeding and genetics research and application.
Assuntos
Algoritmos , Conjuntos de Dados como Assunto/normas , Estudo de Associação Genômica Ampla/métodos , Haplótipos , Animais , Estudo de Associação Genômica Ampla/normas , Heterozigoto , Gado/genética , Polimorfismo de Nucleotídeo ÚnicoRESUMO
BACKGROUND: Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. RESULTS: Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. CONCLUSIONS: By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from NGS technologies. It enables the user to quickly and easily examine the effects of varying thresholds and filters on the number and quality of the resulting genotype calls. In this manner, users can decide on thresholds that are most suitable for their purposes. We show that LinkImputeR can significantly augment the value and utility of NGS data sets, especially in non-model organisms with poor genomic resources.
Assuntos
Genômica/métodos , Técnicas de Genotipagem/métodos , Estudo de Associação Genômica Ampla , Análise de Sequência de DNA , SoftwareRESUMO
BACKGROUND: This paper describes a combined heuristic and hidden Markov model (HMM) method to accurately impute missing genotypes in livestock datasets. Genomic selection in breeding programs requires high-density genotyping of many individuals, making algorithms that economically generate this information crucial. There are two common classes of imputation methods, heuristic methods and probabilistic methods, the latter being largely based on hidden Markov models. Heuristic methods are robust, but fail to impute markers in regions where the thresholds of heuristic rules are not met, or the pedigree is inconsistent. Hidden Markov models are probabilistic methods which typically do not require specific family structures or pedigree information, making them very flexible, but they are computationally expensive and, in some cases, less accurate. RESULTS: We implemented a new hybrid imputation method that combined heuristic and HMM methods, AlphaImpute and MaCH, and compared the computation time and imputation accuracy of the three methods. AlphaImpute was the fastest, followed by the hybrid method and then the HMM. The computation time of the hybrid method and the HMM increased linearly with the number of iterations used in the hidden Markov model, however, the computation time of the hybrid method increased almost linearly and that of the HMM quadratically with the number of template haplotypes. The hybrid method was the most accurate imputation method for low-density panels when pedigree information was missing, especially if minor allele frequency was also low. The accuracy of the hybrid method and the HMM increased with the number of template haplotypes. The imputation accuracy of all three methods increased with the marker density of the low-density panels. Excluding the pedigree information reduced imputation accuracy for the hybrid method and AlphaImpute. Finally, the imputation accuracy of the three methods decreased with decreasing minor allele frequency. CONCLUSIONS: The hybrid heuristic and probabilistic imputation method is able to impute all markers for all individuals in a population, as the HMM. The hybrid method is usually more accurate and never significantly less accurate than a purely heuristic method or a purely probabilistic method and is faster than a standard probabilistic method.
Assuntos
Cruzamento/métodos , Estudo de Associação Genômica Ampla/métodos , Gado/genética , Software , Animais , Cruzamento/normas , Frequência do Gene , Estudo de Associação Genômica Ampla/normas , GenótipoRESUMO
BACKGROUND: Grapes are one of the world's most valuable crops and most are made into wine. Grapes belong to the genus Vitis, which includes over 60 inter-fertile species. The most common grape cultivars derive their entire ancestry from the species Vitis vinifera, but wild relatives have also been exploited to create hybrid cultivars, often with increased disease resistance. RESULTS: We evaluate the genetic ancestry of some of the most widely grown commercial hybrids from North America and Europe. Using genotyping-by-sequencing (GBS), we generated 2482 SNPs and 56 indels from 7 wild Vitis, 7 V. vinifera, and 64 hybrid cultivars. We used a principal component analysis (PCA) based ancestry estimation procedure and verified its accuracy with both empirical and simulated data. V. vinifera ancestry ranged from 11 % to 76 % across hybrids studied. Approximately one third (22/64) of the hybrids have ancestry estimates consistent with F1 hybridization: they derive half of their ancestry from wild Vitis and half from V. vinifera. CONCLUSIONS: Our results suggest that hybrid grape breeding is in its infancy. The distribution of V. vinifera ancestry across hybrids also suggests that backcrosses to wild Vitis species have been more frequent than backcrosses to V. vinifera during hybrid grape breeding. This pattern is unusual in crop breeding, as it is most common to repeatedly backcross to elite, or domesticated, germplasm. We anticipate our method can be extended to facilitate marker-assisted selection in order to introgress beneficial wild Vitis traits, while allowing for offspring with the highest V. vinifera content to be selected at the seedling stage.
Assuntos
Cruzamento , Genoma de Planta , Genômica , Vitis/genética , Produtos Agrícolas , Genômica/métodos , Genótipo , Hibridização GenéticaRESUMO
UNLABELLED: Phylogenetic models are an important tool in molecular evolution allowing us to study the pattern and rate of sequence change. The recent influx of new sequence data in the biosciences means that to address evolutionary questions, we need a means for rapid and easy model development and implementation. Here we present GeLL, a Java library that lets users use text to quickly and efficiently define novel forms of discrete data and create new substitution models that describe how those data change on a phylogeny. GeLL allows users to define general substitution models and data structures in a way that is not possible in other existing libraries, including mixture models and non-reversible models. Classes are provided for calculating likelihoods, optimizing model parameters and branch lengths, ancestral reconstruction and sequence simulation. AVAILABILITY AND IMPLEMENTATION: http://phylo.bio.ku.edu/GeLL under a GPL v3 license.
Assuntos
Filogenia , Software , Funções Verossimilhança , Modelos GenéticosRESUMO
Perennial crops represent important fresh and processed food sources worldwide, but advancements in breeding perennials are often impeded due to their very nature. The perennial crops we rely on most for food take several years to reach production maturity and require large spaces to grow, which make breeding new cultivars costly compared with most annual crops. Because breeding perennials is inefficient and expensive, they are often grown in monocultures consisting of small numbers of elite cultivars that are vegetatively propagated for decades or even centuries. This practice puts many perennial crops at risk for calamity since they remain stationary in the face of evolving pest and disease pressures. Although there is tremendous genetic diversity available to them, perennial crop breeders often struggle to generate commercially successful cultivars in a timely and cost-effective manner because of the high costs of breeding. Moreover, consumers often expect the same cultivars to be available indefinitely, and there is often little or no incentive for growers and retailers to take the risk of adopting new cultivars. While genomics studies linking DNA variants to commercially important traits have been performed in diverse perennial crops, the translation of these studies into accelerated breeding of improved cultivars has been limited. Here we explain the "perennial problem" in detail and demonstrate how modern genomics tools can significantly improve the cost effectiveness of breeding perennial crops and thereby prevent crucial food sources from succumbing to the perils of perpetual propagation.
Assuntos
Agricultura/métodos , Cruzamento , Produtos Agrícolas/genética , Genoma de Planta , Genômica , Fenótipo , Seleção Genética , Produtos Agrícolas/crescimento & desenvolvimento , DNA de Plantas , Abastecimento de Alimentos , Variação Genética , Genótipo , Plantas Geneticamente Modificadas , Reprodução AssexuadaRESUMO
MOTIVATION: Recent large-scale studies of individuals within a population have demonstrated that there is widespread variation in copy number in many gene families. In addition, there is increasing evidence that the variation in gene copy number can give rise to substantial phenotypic effects. In some cases, these variations have been shown to be adaptive. These observations show that a full understanding of the evolution of biological function requires an understanding of gene gain and gene loss. Accurate, robust evolutionary models of gain and loss events are, therefore, required. RESULTS: We have developed weighted parsimony and maximum likelihood methods for inferring gain and loss events. To test these methods, we have used Markov models of gain and loss to simulate data with known properties. We examine three models: a simple birth-death model, a single rate model and a birth-death innovation model with parameters estimated from Drosophila genome data. We find that for all simulations maximum likelihood-based methods are very accurate for reconstructing the number of duplication events on the phylogenetic tree, and that maximum likelihood and weighted parsimony have similar accuracy for reconstructing the ancestral state. Our implementations are robust to different model parameters and provide accurate inferences of ancestral states and the number of gain and loss events. For ancestral reconstruction, we recommend weighted parsimony because it has similar accuracy to maximum likelihood, but is much faster. For inferring the number of individual gene loss or gain events, maximum likelihood is noticeably more accurate, albeit at greater computational cost. AVAILABILITY: www.bioinf.manchester.ac.uk/dupliphy CONTACT: simon.lovell@manchester.ac.uk; simon.whelan@manchester.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Drosophila/genética , Evolução Molecular , Modelos Genéticos , Animais , Simulação por Computador , Drosophila/classificação , Genoma de Inseto , Funções Verossimilhança , Cadeias de MarkovRESUMO
Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.
Assuntos
Filogenia , Leveduras/genética , Algoritmos , Classificação/métodos , Biologia Computacional , Genoma , Funções Verossimilhança , Modelos Biológicos , Leveduras/classificaçãoRESUMO
Many metagenomic and environmental DNA studies require the taxonomic assignment of individual reads or sequences by aligning reads to a reference database, known as taxonomic binning. When a read aligns to more than one reference sequence, it is often classified based on sequence similarity. This step can assign reads to incorrect taxa, at a rate which depends both on the assignment algorithm and on underlying population genetic and database parameters. In particular, as we move towards using environmental DNA to study eukaryotic taxa subject to regular recombination, we must take into account issues concerning gene tree discordance. Though accuracy is often compared across algorithms using a fixed data set, the relative impact of these population genetic and database parameters on accuracy has not yet been quantified. Here, we develop both a theoretical and simulation framework in the simplified case of two reference species, and compute binning accuracy over a wide range of parameters, including sequence length, species-query divergence time, divergence times of the reference species, reference database completeness, sample age and effective population size. We consider two assignment methods and contextualize our results using parameters from a recent ancient environmental DNA study, comparing them to the commonly used discriminative k-mer-based method Clark (Current Biology, 31, 2021, 2728; BMC Genomics, 16, 2015, 1). Our results quantify the degradation in assignment accuracy as the samples diverge from their closest reference sequence, and with incompleteness of reference sequences. We also provide a framework in which others can compute expected accuracy for their particular method or parameter set. Code is available at https://github.com/bdesanctis/binning-accuracy.
Assuntos
DNA Ambiental , Algoritmos , Metagenoma , Metagenômica/métodos , Análise de Sequência de DNA , SoftwareRESUMO
Phylogenetic tree-search is a major aspect of many evolutionary studies. Several tree rearrangement algorithms are available for tree-search, but it is hard to draw general conclusions about their relative performance because many effects are data set specific and can be highly dependent on individual implementations (e.g., RAxML or phyml). Using only the structure of the rearrangements proposed by the Nearest Neighbor Interchange (NNI) algorithm, we show tree-search can prematurely terminate if it encounters multifurcating trees. We validate the relevance of this result by demonstrating that in real data the majority of possible bifurcating trees potentially encountered during tree-search are actually multifurcations, which suggests NNI would be expected to perform poorly. We also show that the star-decomposition algorithm is a special case of two other popular tree-search algorithms, subtree pruning and regrafting (SPR) and tree bisection and reconnection (TBR), which means that these two algorithms can efficiently escape when they encounter multifurcations. We caution against the use of the NNI algorithm and for most applications we recommend the use of more robust tree-search algorithms, such as SPR and TBR.
Assuntos
Evolução Molecular , Modelos Genéticos , Filogenia , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Funções VerossimilhançaRESUMO
Apple ( X Borkh.) is one of the world's most valuable fruit crops. Its large size and long juvenile phase make it a particularly promising candidate for marker-assisted selection (MAS). However, advances in MAS in apple have been limited by a lack of phenotype and genotype data from sufficiently large samples. To establish genotype-phenotype relationships and advance MAS in apple, we extracted over 24,000 phenotype scores from the USDA-Germplasm Resources Information Network (GRIN) database and linked them with over 8000 single nucleotide polymorphisms (SNPs) from 689 apple accessions from the USDA apple germplasm collection clonally preserved in Geneva, NY. We find significant genetic differentiation between Old World and New World cultivars and demonstrate that the genetic structure of the domesticated apple also reflects the time required for ripening. A genome-wide association study (GWAS) of 36 phenotypes confirms the association between fruit color and the MYB1 locus, and we also report a novel association between the transcription factor, NAC18.1, and harvest date and fruit firmness. We demonstrate that harvest time and fruit size can be predicted with relatively high accuracies ( > 0.46) using genomic prediction. Rapid decay of linkage disequilibrium (LD) in apples means millions of SNPs may be required for well-powered GWAS. However, rapid LD decay also promises to enable extremely high resolution mapping of causal variants, which holds great potential for advancing MAS.
Assuntos
Genoma de Planta/genética , Malus/genética , Mapeamento Cromossômico , Frutas/genética , Estudo de Associação Genômica Ampla , Genótipo , Desequilíbrio de Ligação , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates.
Assuntos
Estudo de Associação Genômica Ampla/métodos , Genótipo , Técnicas de Genotipagem/métodos , Software , Genoma de Planta , Malus/genética , Polimorfismo de Nucleotídeo Único , Sensibilidade e EspecificidadeRESUMO
The complement of genes found in the genome is a balance between gene gain and gene loss. Knowledge of the specific genes that are gained and lost over evolutionary time allows an understanding of the evolution of biological functions. Here we use new evolutionary models to infer gene family histories across complete yeast genomes; these models allow us to estimate the relative genome-wide rates of gene birth, death, innovation and extinction (loss of an entire family) for the first time. We show that the rates of gene family evolution vary both between gene families and between species. We are also able to identify those families that have experienced rapid lineage specific expansion/contraction and show that these families are enriched for specific functions. Moreover, we find that families with specific functions are repeatedly expanded in multiple species, suggesting the presence of common adaptations and that these family expansions/contractions are not random. Additionally, we identify potential specialisations, unique to specific species, in the functions of lineage specific expanded families. These results suggest that an important mechanism in the evolution of genome content is the presence of lineage-specific gene family changes.