RESUMO
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Assuntos
Perfilação da Expressão Gênica , RNA-Seq , Humanos , Animais , Camundongos , RNA-Seq/métodos , Perfilação da Expressão Gênica/métodos , Transcriptoma , Análise de Sequência de RNA/métodos , Anotação de Sequência Molecular/métodosRESUMO
We assembled the 9.8-Gbp genome of western redcedar (WRC; Thuja plicata), an ecologically and economically important conifer species of the Cupressaceae. The genome assembly, derived from a uniquely inbred tree produced through five generations of self-fertilization (selfing), was determined to be 86% complete by BUSCO analysis, one of the most complete genome assemblies for a conifer. Population genomic analysis revealed WRC to be one of the most genetically depauperate wild plant species, with an effective population size of approximately 300 and no significant genetic differentiation across its geographic range. Nucleotide diversity, π, is low for a continuous tree species, with many loci showing zero diversity, and the ratio of π at zero- to fourfold degenerate sites is relatively high (approximately 0.33), suggestive of weak purifying selection. Using an array of genetic lines derived from up to five generations of selfing, we explored the relationship between genetic diversity and mating system. Although overall heterozygosity was found to decline faster than expected during selfing, heterozygosity persisted at many loci, and nearly 100 loci were found to deviate from expectations of genetic drift, suggestive of associative overdominance. Nonreference alleles at such loci often harbor deleterious mutations and are rare in natural populations, implying that balanced polymorphisms are maintained by linkage to dominant beneficial alleles. This may account for how WRC remains responsive to natural and artificial selection, despite low genetic diversity.
Assuntos
Traqueófitas , Traqueófitas/genética , Autofertilização/genética , Alelos , Heterozigoto , Polimorfismo Genético , Variação Genética , Seleção GenéticaRESUMO
Conifers are long-lived and slow-evolving, thus requiring effective defences against their fast-evolving insect natural enemies. The copy number variation (CNV) of two key acetophenone biosynthesis genes Ugt5/Ugt5b and ßglu-1 may provide a plausible mechanism underlying the constitutively variable defence in white spruce (Picea glauca) against its primary defoliator, spruce budworm. This study develops a long-insert sequence capture probe set (Picea_hung_p1.0) for quantifying copy number of ßglu-1-like, Ugt5-like genes and single-copy genes on 38 Norway spruce (Picea abies) and 40 P. glauca individuals from eight and nine provenances across Europe and North America respectively. We developed local assemblies (Piabi_c1.0 and Pigla_c.1.0), full-length transcriptomes (PIAB_v1 and PIGL_v1), and gene models to characterise the diversity of ßglu-1 and Ugt5 genes. We observed very large copy numbers of ßglu-1, with up to 381 copies in a single P. glauca individual. We observed among-provenance CNV of ßglu-1 in P. glauca but not P. abies. Ugt5b was predominantly single-copy in both species. This study generates critical hypotheses for testing the emergence and mechanism of extreme CNV, the dosage effect on phenotype, and the varying copy number of genes with the same pathway. We demonstrate new approaches to overcome experimental challenges in genomic research in conifer defences.
Assuntos
Picea , Humanos , Picea/genética , Picea/metabolismo , Variações do Número de Cópias de DNA , beta-Glucosidase/genética , Genômica , TranscriptomaRESUMO
Spruces (Picea spp.) are coniferous trees widespread in boreal and mountainous forests of the northern hemisphere, with large economic significance and enormous contributions to global carbon sequestration. Spruces harbor very large genomes with high repetitiveness, hampering their comparative analysis. Here, we present and compare the genomes of four different North American spruces: the genome assemblies for Engelmann spruce (Picea engelmannii) and Sitka spruce (Picea sitchensis) together with improved and more contiguous genome assemblies for white spruce (Picea glauca) and for a naturally occurring introgress of these three species known as interior spruce (P. engelmannii × glauca × sitchensis). The genomes were structurally similar, and a large part of scaffolds could be anchored to a genetic map. The composition of the interior spruce genome indicated asymmetric contributions from the three ancestral genomes. Phylogenetic analysis of the nuclear and organelle genomes revealed a topology indicative of ancient reticulation. Different patterns of expansion of gene families among genomes were observed and related with presumed diversifying ecological adaptations. We identified rapidly evolving genes that harbored high rates of non-synonymous polymorphisms relative to synonymous ones, indicative of positive selection and its hitchhiking effects. These gene sets were mostly distinct between the genomes of ecologically contrasted species, and signatures of convergent balancing selection were detected. Stress and stimulus response was identified as the most frequent function assigned to expanding gene families and rapidly evolving genes. These two aspects of genomic evolution were complementary in their contribution to divergent evolution of presumed adaptive nature. These more contiguous spruce giga-genome sequences should strengthen our understanding of conifer genome structure and evolution, as their comparison offers clues into the genetic basis of adaptation and ecology of conifers at the genomic level. They will also provide tools to better monitor natural genetic diversity and improve the management of conifer forests. The genomes of four closely related North American spruces indicate that their high similarity at the morphological level is paralleled by the high conservation of their physical genome structure. Yet, the evidence of divergent evolution is apparent in their rapidly evolving genomes, supported by differential expansion of key gene families and large sets of genes under positive selection, largely in relation to stimulus and environmental stress response.
Assuntos
Picea , Traqueófitas , Etiquetas de Sequências Expressas , Genoma de Planta/genética , Família Multigênica/genética , Filogenia , Picea/genética , Traqueófitas/genéticaRESUMO
BACKGROUND: The mountain pine beetle, Dendroctonus ponderosae, is an irruptive bark beetle that causes extensive mortality to many pine species within the forests of western North America. Driven by climate change and wildfire suppression, a recent mountain pine beetle (MPB) outbreak has spread across more than 18 million hectares, including areas to the east of the Rocky Mountains that comprise populations and species of pines not previously affected. Despite its impacts, there are few tactics available to control MPB populations. Beauveria bassiana is an entomopathogenic fungus used as a biological agent in agriculture and forestry and has potential as a management tactic for the mountain pine beetle population. This work investigates the phenotypic and genomic variation between B. bassiana strains to identify optimal strains against a specific insect. RESULTS: Using comparative genome and transcriptome analyses of eight B. bassiana isolates, we have identified the genetic basis of virulence, which includes oosporein production. Genes unique to the more virulent strains included functions in biosynthesis of mycotoxins, membrane transporters, and transcription factors. Significant differential expression of genes related to virulence, transmembrane transport, and stress response was identified between the different strains, as well as up to nine-fold upregulation of genes involved in the biosynthesis of oosporein. Differential correlation analysis revealed transcription factors that may be involved in regulating oosporein production. CONCLUSION: This study provides a foundation for the selection and/or engineering of the most effective strain of B. bassiana for the biological control of mountain pine beetle and other insect pests populations.
Assuntos
Beauveria , Besouros , Animais , Beauveria/genética , Virulência/genética , GenômicaRESUMO
Despite the rapid advance in single-cell RNA sequencing (scRNA-seq) technologies within the last decade, single-cell transcriptome analysis workflows have primarily used gene expression data while isoform sequence analysis at the single-cell level still remains fairly limited. Detection and discovery of isoforms in single cells is difficult because of the inherent technical shortcomings of scRNA-seq data, and existing transcriptome assembly methods are mainly designed for bulk RNA samples. To address this challenge, we developed RNA-Bloom, an assembly algorithm that leverages the rich information content aggregated from multiple single-cell transcriptomes to reconstruct cell-specific isoforms. Assembly with RNA-Bloom can be either reference-guided or reference-free, thus enabling unbiased discovery of novel isoforms or foreign transcripts. We compared both assembly strategies of RNA-Bloom against five state-of-the-art reference-free and reference-based transcriptome assembly methods. In our benchmarks on a simulated 384-cell data set, reference-free RNA-Bloom reconstructed 37.9%-38.3% more isoforms than the best reference-free assembler, whereas reference-guided RNA-Bloom reconstructed 4.1%-11.6% more isoforms than reference-based assemblers. When applied to a real 3840-cell data set consisting of more than 4 billion reads, RNA-Bloom reconstructed 9.7%-25.0% more isoforms than the best competing reference-based and reference-free approaches evaluated. We expect RNA-Bloom to boost the utility of scRNA-seq data beyond gene expression analysis, expanding what is informatically accessible now.
Assuntos
Perfilação da Expressão Gênica/métodos , RNA-Seq/métodos , Análise de Célula Única/métodos , Transcriptoma/genética , Algoritmos , Animais , Sequência de Bases , Humanos , Camundongos , Isoformas de Proteínas/genética , SoftwareRESUMO
MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. RESULTS: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. AVAILABILITY AND IMPLEMENTATION: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Software , Sequência de Bases , Sementes , Análise de Sequência de DNARESUMO
Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.
Assuntos
Análise de Sequência de DNA/métodos , Software , Animais , Pareamento Incorreto de Bases , Humanos , Filogenia , Alinhamento de Sequência , Análise de Sequência de DNA/normasRESUMO
BACKGROUND: De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes. RESULTS: Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%. CONCLUSIONS: RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .
Assuntos
Genômica , Software , Algoritmos , Genoma , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: Antibiotic resistance is a growing global health concern prompting researchers to seek alternatives to conventional antibiotics. Antimicrobial peptides (AMPs) are attracting attention again as therapeutic agents with promising utility in this domain, and using in silico methods to discover novel AMPs is a strategy that is gaining interest. Such methods can sift through large volumes of candidate sequences and reduce lab screening costs. RESULTS: Here we introduce AMPlify, an attentive deep learning model for AMP prediction, and demonstrate its utility in prioritizing peptide sequences derived from the Rana [Lithobates] catesbeiana (bullfrog) genome. We tested the bioactivity of our predicted peptides against a panel of bacterial species, including representatives from the World Health Organization's priority pathogens list. Four of our novel AMPs were active against multiple species of bacteria, including a multi-drug resistant isolate of carbapenemase-producing Escherichia coli. CONCLUSIONS: We demonstrate the utility of deep learning based tools like AMPlify in our fight against antibiotic resistance. We expect such tools to play a significant role in discovering novel candidates of peptide-based alternatives to classical antibiotics.
Assuntos
Peptídeos Catiônicos Antimicrobianos , Aprendizado Profundo , Antibacterianos/farmacologia , Peptídeos Antimicrobianos , Atenção , Organização Mundial da SaúdeRESUMO
Overcoming drug resistance and targeting cancer stem cells remain challenges for curative cancer treatment. To investigate the role of microRNAs (miRNAs) in regulating drug resistance and leukemic stem cell (LSC) fate, we performed global transcriptome profiling in treatment-naive chronic myeloid leukemia (CML) stem/progenitor cells and identified that miR-185 levels anticipate their response to ABL tyrosine kinase inhibitors (TKIs). miR-185 functions as a tumor suppressor: its restored expression impaired survival of drug-resistant cells, sensitized them to TKIs in vitro, and markedly eliminated long-term repopulating LSCs and infiltrating blast cells, conferring a survival advantage in preclinical xenotransplantation models. Integrative analysis with mRNA profiles uncovered PAK6 as a crucial target of miR-185, and pharmacological inhibition of PAK6 perturbed the RAS/MAPK pathway and mitochondrial activity, sensitizing therapy-resistant cells to TKIs. Thus, miR-185 presents as a potential predictive biomarker, and dual targeting of miR-185-mediated PAK6 activity and BCR-ABL1 may provide a valuable strategy for overcoming drug resistance in patients.
Assuntos
Resistencia a Medicamentos Antineoplásicos/genética , Leucemia Mielogênica Crônica BCR-ABL Positiva/genética , MicroRNAs/genética , Células-Tronco Neoplásicas/patologia , Quinases Ativadas por p21/genética , Animais , Regulação Leucêmica da Expressão Gênica/genética , Xenoenxertos , Humanos , Leucemia Mielogênica Crônica BCR-ABL Positiva/tratamento farmacológico , Leucemia Mielogênica Crônica BCR-ABL Positiva/metabolismo , Camundongos , Camundongos SCID , MicroRNAs/metabolismo , Células-Tronco Neoplásicas/metabolismo , Inibidores de Proteínas Quinases/uso terapêutico , Transdução de Sinais/fisiologia , Quinases Ativadas por p21/metabolismoRESUMO
We report on the synthesis of bivalent water-soluble calix[4]arene and calix[5]arene hosts, Super-sCx4 and Super-sCx5 as new broad-spectrum supramolecular binders of neuromuscular blocking agents (NMBAs). Synthesis was achieved using the target bisquaternary amine NMBAs as a template to link two highly anionic p-sulfonatocalixarene building blocks in aqueous solution. Bivalent anionic hosts Super-sCx4 and Super-sCx5 bind by engaging both quaternary amines present on a variety of NMBAs. We report low µM binding to structurally diverse alkyl, steroidal, curarine and benzylisoquinoline NMBAs with high selectivity over the neurotransmitter acetylcholine and a variety of other hydrophobic amines.
Assuntos
Calixarenos/síntese química , Bloqueadores Neuromusculares/síntese química , Aminas/química , Calixarenos/química , Estrutura Molecular , Bloqueadores Neuromusculares/químicaRESUMO
BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS: LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS: Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch .
Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Genoma , Humanos , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNARESUMO
SUMMARY: Presence or absence of gene fusions is one of the most important diagnostic markers in many cancer types. Consequently, fusion detection methods using various genomics data types, such as RNA sequencing (RNA-seq) are valuable tools for research and clinical applications. While information-rich RNA-seq data have proven to be instrumental in discovery of a number of hallmark fusion events, bioinformatics tools to detect fusions still have room for improvement. Here, we present Fusion-Bloom, a fusion detection method that leverages recent developments in de novo transcriptome assembly and assembly-based structural variant calling technologies (RNA-Bloom and PAVFinder, respectively). We benchmarked Fusion-Bloom against the performance of five other state-of-the-art fusion detection tools using multiple datasets. Overall, we observed Fusion-Bloom to display a good balance between detection sensitivity and specificity. We expect the tool to find applications in translational research and clinical genomics pipelines. AVAILABILITY AND IMPLEMENTATION: Fusion-Bloom is implemented as a UNIX Make utility, available at https://github.com/bcgsc/pavfinder and released under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Software , Transcriptoma , Genômica , RNA , Análise de Sequência de RNARESUMO
SUMMARY: The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory. AVAILABILITY AND IMPLEMENTATION: ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma Humano , Humanos , Análise de Sequência de DNA , SinteniaRESUMO
The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.
Assuntos
Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Software , Mapeamento de Sequências Contíguas/normas , Tamanho do Genoma , Genômica/normas , Humanos , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normasRESUMO
SUMMARY: The ORCA bioinformatics environment is a Docker image that contains hundreds of bioinformatics tools and their dependencies. The ORCA image and accompanying server infrastructure provide a comprehensive bioinformatics environment for education and research. The ORCA environment on a server is implemented using Docker containers, but without requiring users to interact directly with Docker, suitable for novices who may not yet have familiarity with managing containers. ORCA has been used successfully to provide a private bioinformatics environment to external collaborators at a large genome institute, for teaching an undergraduate class on bioinformatics targeted at biologists, and to provide a ready-to-go bioinformatics suite for a hackathon. Using ORCA eliminates time that would be spent debugging software installation issues, so that time may be better spent on education and research. AVAILABILITY AND IMPLEMENTATION: The ORCA Docker image is available at https://hub.docker.com/r/bcgsc/orca/. The source code of ORCA is available at https://github.com/bcgsc/orca under the MIT license.
Assuntos
Biologia Computacional , Software , GenomaRESUMO
MOTIVATION: In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. RESULTS: We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. AVAILABILITY AND IMPLEMENTATION: https://github.com/bcgsc/ntedit. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Animais , Genoma Humano , Haploidia , Humanos , Análise de Sequência de DNA , SoftwareRESUMO
Motivation: Sequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due to a lack of linkage information over long stretches of DNA-a shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10 × Genomics. Results: Here, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts. Availability and implementation: https://github.com/bcgsc/ARCS/. Contact: rwarren@bcgsc.ca. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Genômica/métodos , HumanosRESUMO
Motivation: Sequencing studies on non-model organisms often interrogate both genomes and transcriptomes with massive amounts of short sequences. Such studies require de novo analysis tools and techniques, when the species and closely related species lack high quality reference resources. For certain applications such as de novo annotation, information on putative exons and alternative splicing may be desirable. Results: Here we present ChopStitch, a new method for finding putative exons de novo and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also accounts for base substitutions in transcript sequences that may be derived from sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are represented as splice graphs in DOT output format. Availability and implementation: ChopStitch is written in Python and C++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ChopStitch. Contact: hkhan@bcgsc.ca or ibirol@bcgsc.ca. Supplementary information: Supplementary data are available at Bioinformatics online.