RESUMEN
Motivation: De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies. Results: We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies. Availability and Implementation: KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT . Contact: bernardo.clavijo@earlham.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genoma de Planta , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Control de Calidad , Análisis de Secuencia de ADN/normas , Programas Informáticos , Fraxinus/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodosRESUMEN
The Sequence Distance Graph (SDG) framework works with genome assembly graphs and raw data from paired, linked and long reads. It includes a simple deBruijn graph module, and can import graphs using the graphical fragment assembly (GFA) format. It also maps raw reads onto graphs, and provides a Python application programming interface (API) to navigate the graph, access the mapped and raw data and perform interactive or scripted analyses. Its complete workspace can be dumped to and loaded from disk, decoupling mapping from analysis and supporting multi-stage pipelines. We present the design and implementation of the framework, and example analyses scaffolding a short read graph with long reads, and navigating paths in a heterozygous graph for a simulated parent-offspring trio dataset. SDG is freely available under the MIT license at https://github.com/bioinfologics/sdg.
Asunto(s)
Análisis de Secuencia de ADN , Programas Informáticos , GenómicaRESUMEN
We used 20 de novo genome assemblies to probe the speciation history and architecture of gene flow in rapidly radiating Heliconius butterflies. Our tests to distinguish incomplete lineage sorting from introgression indicate that gene flow has obscured several ancient phylogenetic relationships in this group over large swathes of the genome. Introgressed loci are underrepresented in low-recombination and gene-rich regions, consistent with the purging of foreign alleles more tightly linked to incompatibility loci. Here, we identify a hitherto unknown inversion that traps a color pattern switch locus. We infer that this inversion was transferred between lineages by introgression and is convergent with a similar rearrangement in another part of the genus. These multiple de novo genome sequences enable improved understanding of the importance of introgression and selective processes in adaptive radiation.
Asunto(s)
Mariposas Diurnas/genética , Flujo Génico , Introgresión Genética , Genoma de los Insectos , Animales , Evolución Biológica , Mariposas Diurnas/anatomía & histología , Inversión Cromosómica , Genes de Insecto , Especiación Genética , Filogenia , Alas de Animales/anatomía & histologíaRESUMEN
Accelerating international trade and climate change make pathogen spread an increasing concern. Hymenoscyphus fraxineus, the causal agent of ash dieback, is a fungal pathogen that has been moving across continents and hosts from Asian to European ash. Most European common ash trees (Fraxinus excelsior) are highly susceptible to H. fraxineus, although a minority (~5%) have partial resistance to dieback. Here, we assemble and annotate a H. fraxineus draft genome, which approaches chromosome scale. Pathogen genetic diversity across Europe and in Japan, reveals a strong bottleneck in Europe, though a signal of adaptive diversity remains in key host interaction genes. We find that the European population was founded by two divergent haploid individuals. Divergence between these haplotypes represents the ancestral polymorphism within a large source population. Subsequent introduction from this source would greatly increase adaptive potential of the pathogen. Thus, further introgression of H. fraxineus into Europe represents a potential threat and Europe-wide biological security measures are needed to manage this disease.