ABSTRACT
We assembled the 9.8-Gbp genome of western redcedar (WRC; Thuja plicata), an ecologically and economically important conifer species of the Cupressaceae. The genome assembly, derived from a uniquely inbred tree produced through five generations of self-fertilization (selfing), was determined to be 86% complete by BUSCO analysis, one of the most complete genome assemblies for a conifer. Population genomic analysis revealed WRC to be one of the most genetically depauperate wild plant species, with an effective population size of approximately 300 and no significant genetic differentiation across its geographic range. Nucleotide diversity, π, is low for a continuous tree species, with many loci showing zero diversity, and the ratio of π at zero- to fourfold degenerate sites is relatively high (approximately 0.33), suggestive of weak purifying selection. Using an array of genetic lines derived from up to five generations of selfing, we explored the relationship between genetic diversity and mating system. Although overall heterozygosity was found to decline faster than expected during selfing, heterozygosity persisted at many loci, and nearly 100 loci were found to deviate from expectations of genetic drift, suggestive of associative overdominance. Nonreference alleles at such loci often harbor deleterious mutations and are rare in natural populations, implying that balanced polymorphisms are maintained by linkage to dominant beneficial alleles. This may account for how WRC remains responsive to natural and artificial selection, despite low genetic diversity.
Subject(s)
Tracheophyta , Tracheophyta/genetics , Self-Fertilization/genetics , Alleles , Heterozygote , Polymorphism, Genetic , Genetic Variation , Selection, GeneticABSTRACT
The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.
Subject(s)
Contig Mapping/methods , Genomics/methods , Software , Contig Mapping/standards , Genome Size , Genomics/standards , Humans , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standardsABSTRACT
SUMMARY: The ORCA bioinformatics environment is a Docker image that contains hundreds of bioinformatics tools and their dependencies. The ORCA image and accompanying server infrastructure provide a comprehensive bioinformatics environment for education and research. The ORCA environment on a server is implemented using Docker containers, but without requiring users to interact directly with Docker, suitable for novices who may not yet have familiarity with managing containers. ORCA has been used successfully to provide a private bioinformatics environment to external collaborators at a large genome institute, for teaching an undergraduate class on bioinformatics targeted at biologists, and to provide a ready-to-go bioinformatics suite for a hackathon. Using ORCA eliminates time that would be spent debugging software installation issues, so that time may be better spent on education and research. AVAILABILITY AND IMPLEMENTATION: The ORCA Docker image is available at https://hub.docker.com/r/bcgsc/orca/. The source code of ORCA is available at https://github.com/bcgsc/orca under the MIT license.
Subject(s)
Computational Biology , Software , GenomeABSTRACT
BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.
Subject(s)
Chromosomes, Human/genetics , Genome, Human , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , HumansABSTRACT
BACKGROUND: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap. RESULTS: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. CONCLUSIONS: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.
Subject(s)
High-Throughput Nucleotide Sequencing/methods , Software , Chromosomes, Human/genetics , Genome, Human , Genomics , Humans , Nanopores , Repetitive Sequences, Nucleic AcidABSTRACT
Medulloblastoma, the most common malignant paediatric brain tumour, is currently treated with nonspecific cytotoxic therapies including surgery, whole-brain radiation, and aggressive chemotherapy. As medulloblastoma exhibits marked intertumoural heterogeneity, with at least four distinct molecular variants, previous attempts to identify targets for therapy have been underpowered because of small samples sizes. Here we report somatic copy number aberrations (SCNAs) in 1,087 unique medulloblastomas. SCNAs are common in medulloblastoma, and are predominantly subgroup-enriched. The most common region of focal copy number gain is a tandem duplication of SNCAIP, a gene associated with Parkinson's disease, which is exquisitely restricted to Group 4α. Recurrent translocations of PVT1, including PVT1-MYC and PVT1-NDRG1, that arise through chromothripsis are restricted to Group 3. Numerous targetable SCNAs, including recurrent events targeting TGF-ß signalling in Group 3, and NF-κB signalling in Group 4, suggest future avenues for rational, targeted therapy.
Subject(s)
Cerebellar Neoplasms/classification , Cerebellar Neoplasms/genetics , Genome, Human/genetics , Genomic Structural Variation/genetics , Medulloblastoma/classification , Medulloblastoma/genetics , Carrier Proteins/genetics , Cerebellar Neoplasms/metabolism , Child , DNA Copy Number Variations/genetics , Gene Duplication/genetics , Genes, myc/genetics , Genomics , Hedgehog Proteins/metabolism , Humans , Medulloblastoma/metabolism , NF-kappa B/metabolism , Nerve Tissue Proteins/genetics , Oncogene Proteins, Fusion/genetics , Proteins/genetics , RNA, Long Noncoding , Signal Transduction , Transforming Growth Factor beta/metabolism , Translocation, Genetic/geneticsABSTRACT
Follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL) are the two most common non-Hodgkin lymphomas (NHLs). Here we sequenced tumour and matched normal DNA from 13 DLBCL cases and one FL case to identify genes with mutations in B-cell NHL. We analysed RNA-seq data from these and another 113 NHLs to identify genes with candidate mutations, and then re-sequenced tumour and matched normal DNA from these cases to confirm 109 genes with multiple somatic mutations. Genes with roles in histone modification were frequent targets of somatic mutation. For example, 32% of DLBCL and 89% of FL cases had somatic mutations in MLL2, which encodes a histone methyltransferase, and 11.4% and 13.4% of DLBCL and FL cases, respectively, had mutations in MEF2B, a calcium-regulated gene that cooperates with CREBBP and EP300 in acetylating histones. Our analysis suggests a previously unappreciated disruption of chromatin biology in lymphomagenesis.
Subject(s)
Histones/metabolism , Lymphoma, Non-Hodgkin/genetics , Mutation/genetics , Chromatin/genetics , Chromatin/metabolism , DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Genome, Human/genetics , Histone Acetyltransferases/genetics , Histone Acetyltransferases/metabolism , Histone Methyltransferases , Histone-Lysine N-Methyltransferase/genetics , Histone-Lysine N-Methyltransferase/metabolism , Humans , Loss of Heterozygosity/genetics , Lymphoma, Follicular/enzymology , Lymphoma, Follicular/genetics , Lymphoma, Large B-Cell, Diffuse/enzymology , Lymphoma, Large B-Cell, Diffuse/genetics , Lymphoma, Non-Hodgkin/enzymology , MADS Domain Proteins/genetics , MADS Domain Proteins/metabolism , MEF2 Transcription Factors , Myogenic Regulatory Factors/genetics , Myogenic Regulatory Factors/metabolism , Neoplasm Proteins/genetics , Neoplasm Proteins/metabolismABSTRACT
White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.
Subject(s)
Genome, Plant , Multigene Family , Phenols/metabolism , Picea/genetics , Terpenes/metabolism , Alkyl and Aryl Transferases/metabolism , Computational Biology , Cytochrome P-450 Enzyme System/metabolism , TranscriptomeABSTRACT
BACKGROUND: While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes. RESULTS: Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8% and 13.8% of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively - a feat that is not possible with other leading tools with the breadth of data used in our study. CONCLUSION: Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.
Subject(s)
Computational Biology/methods , User-Computer Interface , Algorithms , Genome, Human , Genome, Plant , High-Throughput Nucleotide Sequencing , Humans , Internet , Pinaceae/genetics , Sequence Analysis, DNAABSTRACT
Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools.
Subject(s)
Sequence Analysis, DNA/methods , Software , Algorithms , Animals , Humans , MiceABSTRACT
UNLABELLED: White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20,356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies. AVAILABILITY: The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.
Subject(s)
Genome, Plant , Genomics/methods , Picea/genetics , Base Sequence , Molecular Sequence Data , Sequence Analysis, DNA , SoftwareABSTRACT
BACKGROUND: Ophiostoma piceae is a wood-staining fungus that grows in the sapwood of conifer logs and lumber. We sequenced its genome and analyzed its transcriptomes under a range of growth conditions. A comparison with the genome and transcriptomes of the mountain pine beetle-associated pathogen Grosmannia clavigera highlights differences between a pathogen that colonizes and kills living pine trees and a saprophyte that colonizes wood and the inner bark of dead trees. RESULTS: We assembled a 33 Mbp genome in 45 scaffolds, and predicted approximately 8,884 genes. The genome size and gene content were similar to those of other ascomycetes. Despite having similar ecological niches, O. piceae and G. clavigera showed no large-scale synteny. We identified O. piceae genes involved in the biosynthesis of melanin, which causes wood discoloration and reduces the commercial value of wood products. We also identified genes and pathways involved in growth on simple carbon sources and in sapwood, O. piceae's natural substrate. Like the pathogen, the saprophyte is able to tolerate terpenes, which are a major class of pine tree defense compounds; unlike the pathogen, it cannot utilize monoterpenes as a carbon source. CONCLUSIONS: This work makes available the second annotated genome of a softwood ophiostomatoid fungus, and suggests that O. piceae's tolerance to terpenes may be due in part to these chemicals being removed from the cells by an ABC transporter that is highly induced by terpenes. The data generated will provide the research community with resources for work on host-vector-fungus interactions for wood-inhabiting, beetle-associated saprophytes and pathogens.
Subject(s)
Coleoptera/microbiology , Genome, Fungal/genetics , Ophiostoma/genetics , Ophiostoma/physiology , Pinus/microbiology , Transcriptome , Animals , Mannose/pharmacology , Molecular Sequence Annotation , Oleic Acid/pharmacology , Ophiostoma/drug effects , Ophiostoma/growth & development , Species Specificity , Triglycerides/pharmacology , Wood/microbiologyABSTRACT
BACKGROUND: Chimeric transcripts, including partial and internal tandem duplications (PTDs, ITDs) and gene fusions, are important in the detection, prognosis, and treatment of human cancers. RESULTS: We describe Barnacle, a production-grade analysis tool that detects such chimeras in de novo assemblies of RNA-seq data, and supports prioritizing them for review and validation by reporting the relative coverage of co-occurring chimeric and wild-type transcripts. We demonstrate applications in large-scale disease studies, by identifying PTDs in MLL, ITDs in FLT3, and reciprocal fusions between PML and RARA, in two deeply sequenced acute myeloid leukemia (AML) RNA-seq datasets. CONCLUSIONS: Our analyses of real and simulated data sets show that, with appropriate filter settings, Barnacle makes highly specific predictions for three types of chimeric transcripts that are important in a range of cancers: PTDs, ITDs, and fusions. High specificity makes manual review and validation efficient, which is necessary in large-scale disease studies. Characterizing an extended range of chimera types will help generate insights into progression, treatment, and outcomes for complex diseases.
Subject(s)
Gene Duplication/genetics , Gene Expression Profiling/methods , Gene Fusion/genetics , Genomics , Breast Neoplasms/genetics , Exons/genetics , Humans , Leukemia, Myeloid, Acute/genetics , Molecular Sequence Annotation , RNA, Messenger/genetics , Statistics as TopicABSTRACT
We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to reference-based assembly methods.
Subject(s)
Computational Biology/methods , Gene Expression Profiling , Sequence Analysis, DNA/methods , Animals , MiceABSTRACT
Oligodendroglioma is characterized by unique clinical, pathological, and genetic features. Recurrent losses of chromosomes 1p and 19q are strongly associated with this brain cancer but knowledge of the identity and function of the genes affected by these alterations is limited. We performed exome sequencing on a discovery set of 16 oligodendrogliomas with 1p/19q co-deletion to identify new molecular features at base-pair resolution. As anticipated, there was a high rate of IDH mutations: all cases had mutations in either IDH1 (14/16) or IDH2 (2/16). In addition, we discovered somatic mutations and insertions/deletions in the CIC gene on chromosome 19q13.2 in 13/16 tumours. These discovery set mutations were validated by deep sequencing of 13 additional tumours, which revealed seven others with CIC mutations, thus bringing the overall mutation rate in oligodendrogliomas in this study to 20/29 (69%). In contrast, deep sequencing of astrocytomas and oligoastrocytomas without 1p/19q loss revealed that CIC alterations were otherwise rare (1/60; 2%). Of the 21 non-synonymous somatic mutations in 20 CIC-mutant oligodendrogliomas, nine were in exon 5 within an annotated DNA-interacting domain and three were in exon 20 within an annotated protein-interacting domain. The remaining nine were found in other exons and frequently included truncations. CIC mutations were highly associated with oligodendroglioma histology, 1p/19q co-deletion, and IDH1/2 mutation (p < 0.001). Although we observed no differences in the clinical outcomes of CIC mutant versus wild-type tumours, in a background of 1p/19q co-deletion, hemizygous CIC mutations are likely important. We hypothesize that the mutant CIC on the single retained 19q allele is linked to the pathogenesis of oligodendrogliomas with IDH mutation. Our detailed study of genetic aberrations in oligodendroglioma suggests a functional interaction between CIC mutation, IDH1/2 mutation, and 1p/19q co-deletion.
Subject(s)
Biomarkers, Tumor/genetics , Brain Neoplasms/genetics , Isocitrate Dehydrogenase/genetics , Oligodendroglioma/genetics , Repressor Proteins/genetics , Biomarkers, Tumor/analysis , Brain Neoplasms/mortality , Brain Neoplasms/pathology , Chromosomes, Human, Pair 1/genetics , Chromosomes, Human, Pair 19/genetics , Disease-Free Survival , Humans , Kaplan-Meier Estimate , Mutation , Neoplasm Grading , Oligodendroglioma/mortality , Oligodendroglioma/pathologyABSTRACT
Mycoplasma capricolum subsp. capripneumoniae is the causative agent of contagious caprine pleuropneumonia, a devastating disease of goats listed by the World Organization for Animal Health. Here we report the first complete genome sequence of this organism (strain M1601, a clinically isolated strain from China).
Subject(s)
DNA, Bacterial/chemistry , DNA, Bacterial/genetics , Genome, Bacterial , Mycoplasma capricolum/genetics , Animals , China , Goat Diseases/microbiology , Goats , Molecular Sequence Data , Mycoplasma capricolum/isolation & purification , Pleuropneumonia, Contagious/microbiology , Sequence Analysis, DNAABSTRACT
BACKGROUND: As scientists continue to pursue various 'omics-based research, there is a need for high quality data for the most fundamental 'omics of all: genomics. The bacterium Paenibacillus larvae is the causative agent of the honey bee disease American foulbrood. If untreated, it can lead to the demise of an entire hive; the highly social nature of bees also leads to easy disease spread, between both individuals and colonies. Biologists have studied this organism since the early 1900s, and a century later, the molecular mechanism of infection remains elusive. Transcriptomics and proteomics, because of their ability to analyze multiple genes and proteins in a high-throughput manner, may be very helpful to its study. However, the power of these methodologies is severely limited without a complete genome; we undertake to address that deficiency here. RESULTS: We used the Illumina GAIIx platform and conventional Sanger sequencing to generate a 182-fold sequence coverage of the P. larvae genome, and assembled the data using ABySS into a total of 388 contigs spanning 4.5 Mbp. Comparative genomics analysis against fully-sequenced soil bacteria P. JDR2 and P. vortex showed that regions of poor conservation may contain putative virulence factors. We used GLIMMER to predict 3568 gene models, and named them based on homology revealed by BLAST searches; proteases, hemolytic factors, toxins, and antibiotic resistance enzymes were identified in this way. Finally, mass spectrometry was used to provide experimental evidence that at least 35% of the genes are expressed at the protein level. CONCLUSIONS: This update on the genome of P. larvae and annotation represents an immense advancement from what we had previously known about this species. We provide here a reliable resource that can be used to elucidate the mechanism of infection, and by extension, more effective methods to control and cure this widespread honey bee disease.
Subject(s)
Bees/microbiology , Genome, Bacterial , Paenibacillus/genetics , Animals , Comparative Genomic Hybridization , Computational Biology , DNA, Bacterial/genetics , Molecular Sequence Annotation , Proteomics , Sequence Analysis, DNAABSTRACT
MOTIVATION: Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS: Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION: Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT: ibirol@bcgsc.ca.
Subject(s)
Computational Biology/methods , Gene Expression Profiling , Software , Databases, Genetic , Genome , Sequence Analysis, DNAABSTRACT
Plant mitochondrial genomes vary widely in size. Although many plant mitochondrial genomes have been sequenced and assembled, the vast majority are of angiosperms, and few are of gymnosperms. Most plant mitochondrial genomes are smaller than a megabase, with a few notable exceptions. We have sequenced and assembled the complete 5.5-Mb mitochondrial genome of Sitka spruce (Picea sitchensis), to date, one of the largest mitochondrial genomes of a gymnosperm. We sequenced the whole genome using Oxford Nanopore MinION, and then identified contigs of mitochondrial origin assembled from these long reads based on sequence homology to the white spruce mitochondrial genome. The assembly graph shows a multipartite genome structure, composed of one smaller 168-kb circular segment of DNA, and a larger 5.4-Mb single component with a branching structure. The assembly graph gives insight into a putative complex physical genome structure, and its branching points may represent active sites of recombination.
Subject(s)
Genome, Mitochondrial , Genome, Plant , Picea/genetics , Molecular StructureABSTRACT
An amendment to this paper has been published and can be accessed via a link at the top of the paper.