Search | VHL Search Portal

1.

The western redcedar genome reveals low genetic diversity in a self-compatible conifer.

Shalev, Tal J; Gamal El-Dien, Omnia; Yuen, Macaire M S; Shengqiang, Shu; Jackman, Shaun D; Warren, René L; Coombe, Lauren; van der Merwe, Lise; Stewart, Ada; Boston, Lori B; Plott, Christopher; Jenkins, Jerry; He, Guifen; Yan, Juying; Yan, Mi; Guo, Jie; Breinholt, Jesse W; Neves, Leandro G; Grimwood, Jane; Rieseberg, Loren H; Schmutz, Jeremy; Birol, Inanc; Kirst, Matias; Yanchuk, Alvin D; Ritland, Carol; Russell, John H; Bohlmann, Joerg.

Genome Res ; 32(10): 1952-1964, 2022 10.

Article in English | MEDLINE | ID: mdl-36109148

ABSTRACT

We assembled the 9.8-Gbp genome of western redcedar (WRC; Thuja plicata), an ecologically and economically important conifer species of the Cupressaceae. The genome assembly, derived from a uniquely inbred tree produced through five generations of self-fertilization (selfing), was determined to be 86% complete by BUSCO analysis, one of the most complete genome assemblies for a conifer. Population genomic analysis revealed WRC to be one of the most genetically depauperate wild plant species, with an effective population size of approximately 300 and no significant genetic differentiation across its geographic range. Nucleotide diversity, π, is low for a continuous tree species, with many loci showing zero diversity, and the ratio of π at zero- to fourfold degenerate sites is relatively high (approximately 0.33), suggestive of weak purifying selection. Using an array of genetic lines derived from up to five generations of selfing, we explored the relationship between genetic diversity and mating system. Although overall heterozygosity was found to decline faster than expected during selfing, heterozygosity persisted at many loci, and nearly 100 loci were found to deviate from expectations of genetic drift, suggestive of associative overdominance. Nonreference alleles at such loci often harbor deleterious mutations and are rare in natural populations, implying that balanced polymorphisms are maintained by linkage to dominant beneficial alleles. This may account for how WRC remains responsive to natural and artificial selection, despite low genetic diversity.

Subject(s)

Tracheophyta , Tracheophyta/genetics , Self-Fertilization/genetics , Alleles , Heterozygote , Polymorphism, Genetic , Genetic Variation , Selection, Genetic

2.

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter.

Jackman, Shaun D; Vandervalk, Benjamin P; Mohamadi, Hamid; Chu, Justin; Yeo, Sarah; Hammond, S Austin; Jahesh, Golnaz; Khan, Hamza; Coombe, Lauren; Warren, Rene L; Birol, Inanc.

Genome Res ; 27(5): 768-777, 2017 05.

Article in English | MEDLINE | ID: mdl-28232478

ABSTRACT

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

Subject(s)

Contig Mapping/methods , Genomics/methods , Software , Contig Mapping/standards , Genome Size , Genomics/standards , Humans , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards

3.

ORCA: a comprehensive bioinformatics container environment for education and research.

Jackman, Shaun D; Mozgacheva, Tatyana; Chen, Susie; O'Huiginn, Brendan; Bailey, Lance; Birol, Inanc; Jones, Steven J M.

Bioinformatics ; 35(21): 4448-4450, 2019 11 01.

Article in English | MEDLINE | ID: mdl-31004474

ABSTRACT

SUMMARY: The ORCA bioinformatics environment is a Docker image that contains hundreds of bioinformatics tools and their dependencies. The ORCA image and accompanying server infrastructure provide a comprehensive bioinformatics environment for education and research. The ORCA environment on a server is implemented using Docker containers, but without requiring users to interact directly with Docker, suitable for novices who may not yet have familiarity with managing containers. ORCA has been used successfully to provide a private bioinformatics environment to external collaborators at a large genome institute, for teaching an undergraduate class on bioinformatics targeted at biologists, and to provide a ready-to-go bioinformatics suite for a hackathon. Using ORCA eliminates time that would be spent debugging software installation issues, so that time may be better spent on education and research. AVAILABILITY AND IMPLEMENTATION: The ORCA Docker image is available at https://hub.docker.com/r/bcgsc/orca/. The source code of ORCA is available at https://github.com/bcgsc/orca under the MIT license.

Subject(s)

Computational Biology , Software , Genome

4.

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.

Coombe, Lauren; Zhang, Jessica; Vandervalk, Benjamin P; Chu, Justin; Jackman, Shaun D; Birol, Inanc; Warren, René L.

BMC Bioinformatics ; 19(1): 234, 2018 06 20.

Article in English | MEDLINE | ID: mdl-29925315

ABSTRACT

BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.

Subject(s)

Chromosomes, Human/genetics , Genome, Human , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Humans

5.

Tigmint: correcting assembly errors using linked reads from large molecules.

Jackman, Shaun D; Coombe, Lauren; Chu, Justin; Warren, Rene L; Vandervalk, Benjamin P; Yeo, Sarah; Xue, Zhuyi; Mohamadi, Hamid; Bohlmann, Joerg; Jones, Steven J M; Birol, Inanc.

BMC Bioinformatics ; 19(1): 393, 2018 Oct 26.

Article in English | MEDLINE | ID: mdl-30367597

ABSTRACT

BACKGROUND: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap. RESULTS: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. CONCLUSIONS: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Software , Chromosomes, Human/genetics , Genome, Human , Genomics , Humans , Nanopores , Repetitive Sequences, Nucleic Acid

6.

Subgroup-specific structural variation across 1,000 medulloblastoma genomes.

Northcott, Paul A; Shih, David J H; Peacock, John; Garzia, Livia; Morrissy, A Sorana; Zichner, Thomas; Stütz, Adrian M; Korshunov, Andrey; Reimand, Jüri; Schumacher, Steven E; Beroukhim, Rameen; Ellison, David W; Marshall, Christian R; Lionel, Anath C; Mack, Stephen; Dubuc, Adrian; Yao, Yuan; Ramaswamy, Vijay; Luu, Betty; Rolider, Adi; Cavalli, Florence M G; Wang, Xin; Remke, Marc; Wu, Xiaochong; Chiu, Readman Y B; Chu, Andy; Chuah, Eric; Corbett, Richard D; Hoad, Gemma R; Jackman, Shaun D; Li, Yisu; Lo, Allan; Mungall, Karen L; Nip, Ka Ming; Qian, Jenny Q; Raymond, Anthony G J; Thiessen, Nina T; Varhol, Richard J; Birol, Inanc; Moore, Richard A; Mungall, Andrew J; Holt, Robert; Kawauchi, Daisuke; Roussel, Martine F; Kool, Marcel; Jones, David T W; Witt, Hendrick; Fernandez-L, Africa; Kenney, Anna M; Wechsler-Reya, Robert J.

Nature ; 488(7409): 49-56, 2012 Aug 02.

Article in English | MEDLINE | ID: mdl-22832581

ABSTRACT

Medulloblastoma, the most common malignant paediatric brain tumour, is currently treated with nonspecific cytotoxic therapies including surgery, whole-brain radiation, and aggressive chemotherapy. As medulloblastoma exhibits marked intertumoural heterogeneity, with at least four distinct molecular variants, previous attempts to identify targets for therapy have been underpowered because of small samples sizes. Here we report somatic copy number aberrations (SCNAs) in 1,087 unique medulloblastomas. SCNAs are common in medulloblastoma, and are predominantly subgroup-enriched. The most common region of focal copy number gain is a tandem duplication of SNCAIP, a gene associated with Parkinson's disease, which is exquisitely restricted to Group 4α. Recurrent translocations of PVT1, including PVT1-MYC and PVT1-NDRG1, that arise through chromothripsis are restricted to Group 3. Numerous targetable SCNAs, including recurrent events targeting TGF-ß signalling in Group 3, and NF-κB signalling in Group 4, suggest future avenues for rational, targeted therapy.

Subject(s)

Cerebellar Neoplasms/classification , Cerebellar Neoplasms/genetics , Genome, Human/genetics , Genomic Structural Variation/genetics , Medulloblastoma/classification , Medulloblastoma/genetics , Carrier Proteins/genetics , Cerebellar Neoplasms/metabolism , Child , DNA Copy Number Variations/genetics , Gene Duplication/genetics , Genes, myc/genetics , Genomics , Hedgehog Proteins/metabolism , Humans , Medulloblastoma/metabolism , NF-kappa B/metabolism , Nerve Tissue Proteins/genetics , Oncogene Proteins, Fusion/genetics , Proteins/genetics , RNA, Long Noncoding , Signal Transduction , Transforming Growth Factor beta/metabolism , Translocation, Genetic/genetics

7.

Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg.

Plant J ; 83(2): 189-212, 2015 Jul.

Article in English | MEDLINE | ID: mdl-26017574

ABSTRACT

White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

Subject(s)

Genome, Plant , Multigene Family , Phenols/metabolism , Picea/genetics , Terpenes/metabolism , Alkyl and Aryl Transferases/metabolism , Computational Biology , Cytochrome P-450 Enzyme System/metabolism , Transcriptome

8.

Sealer: a scalable gap-closing application for finishing draft genomes.

Paulino, Daniel; Warren, René L; Vandervalk, Benjamin P; Raymond, Anthony; Jackman, Shaun D; Birol, Inanç.

BMC Bioinformatics ; 16: 230, 2015 Jul 25.

Article in English | MEDLINE | ID: mdl-26209068

ABSTRACT

BACKGROUND: While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes. RESULTS: Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8% and 13.8% of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively - a feat that is not possible with other leading tools with the breadth of data used in our study. CONCLUSION: Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.

Subject(s)

Computational Biology/methods , User-Computer Interface , Algorithms , Genome, Human , Genome, Plant , High-Throughput Nucleotide Sequencing , Humans , Internet , Pinaceae/genetics , Sequence Analysis, DNA

9.

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters.

Chu, Justin; Sadeghi, Sara; Raymond, Anthony; Jackman, Shaun D; Nip, Ka Ming; Mar, Richard; Mohamadi, Hamid; Butterfield, Yaron S; Robertson, A Gordon; Birol, Inanç.

Bioinformatics ; 30(23): 3402-4, 2014 Dec 01.

Article in English | MEDLINE | ID: mdl-25143290

ABSTRACT

Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools.

Subject(s)

Sequence Analysis, DNA/methods , Software , Algorithms , Animals , Humans , Mice

10.

Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data.

Birol, Inanc; Raymond, Anthony; Jackman, Shaun D; Pleasance, Stephen; Coope, Robin; Taylor, Greg A; Yuen, Macaire Man Saint; Keeling, Christopher I; Brand, Dana; Vandervalk, Benjamin P; Kirk, Heather; Pandoh, Pawan; Moore, Richard A; Zhao, Yongjun; Mungall, Andrew J; Jaquish, Barry; Yanchuk, Alvin; Ritland, Carol; Boyle, Brian; Bousquet, Jean; Ritland, Kermit; Mackay, John; Bohlmann, Jörg; Jones, Steven J M.

Bioinformatics ; 29(12): 1492-7, 2013 Jun 15.

Article in English | MEDLINE | ID: mdl-23698863

ABSTRACT

UNLABELLED: White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20,356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies. AVAILABILITY: The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.

Subject(s)

Genome, Plant , Genomics/methods , Picea/genetics , Base Sequence , Molecular Sequence Data , Sequence Analysis, DNA , Software

11.

Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies.

Swanson, Lucas; Robertson, Gordon; Mungall, Karen L; Butterfield, Yaron S; Chiu, Readman; Corbett, Richard D; Docking, T Roderick; Hogge, Donna; Jackman, Shaun D; Moore, Richard A; Mungall, Andrew J; Nip, Ka Ming; Parker, Jeremy D K; Qian, Jenny Qing; Raymond, Anthony; Sung, Sandy; Tam, Angela; Thiessen, Nina; Varhol, Richard; Wang, Sherry; Yorukoglu, Deniz; Zhao, Yongjun; Hoodless, Pamela A; Sahinalp, S Cenk; Karsan, Aly; Birol, Inanc.

BMC Genomics ; 14: 550, 2013 Aug 14.

Article in English | MEDLINE | ID: mdl-23941359

ABSTRACT

BACKGROUND: Chimeric transcripts, including partial and internal tandem duplications (PTDs, ITDs) and gene fusions, are important in the detection, prognosis, and treatment of human cancers. RESULTS: We describe Barnacle, a production-grade analysis tool that detects such chimeras in de novo assemblies of RNA-seq data, and supports prioritizing them for review and validation by reporting the relative coverage of co-occurring chimeric and wild-type transcripts. We demonstrate applications in large-scale disease studies, by identifying PTDs in MLL, ITDs in FLT3, and reciprocal fusions between PML and RARA, in two deeply sequenced acute myeloid leukemia (AML) RNA-seq datasets. CONCLUSIONS: Our analyses of real and simulated data sets show that, with appropriate filter settings, Barnacle makes highly specific predictions for three types of chimeric transcripts that are important in a range of cancers: PTDs, ITDs, and fusions. High specificity makes manual review and validation efficient, which is necessary in large-scale disease studies. Characterizing an extended range of chimera types will help generate insights into progression, treatment, and outcomes for complex diseases.

Subject(s)

Gene Duplication/genetics , Gene Expression Profiling/methods , Gene Fusion/genetics , Genomics , Breast Neoplasms/genetics , Exons/genetics , Humans , Leukemia, Myeloid, Acute/genetics , Molecular Sequence Annotation , RNA, Messenger/genetics , Statistics as Topic

12.

De novo assembly and analysis of RNA-seq data.

Robertson, Gordon; Schein, Jacqueline; Chiu, Readman; Corbett, Richard; Field, Matthew; Jackman, Shaun D; Mungall, Karen; Lee, Sam; Okada, Hisanaga Mark; Qian, Jenny Q; Griffith, Malachi; Raymond, Anthony; Thiessen, Nina; Cezard, Timothee; Butterfield, Yaron S; Newsome, Richard; Chan, Simon K; She, Rong; Varhol, Richard; Kamoh, Baljit; Prabhu, Anna-Liisa; Tam, Angela; Zhao, YongJun; Moore, Richard A; Hirst, Martin; Marra, Marco A; Jones, Steven J M; Hoodless, Pamela A; Birol, Inanc.

Nat Methods ; 7(11): 909-12, 2010 Nov.

Article in English | MEDLINE | ID: mdl-20935650

ABSTRACT

We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to reference-based assembly methods.

Subject(s)

Computational Biology/methods , Gene Expression Profiling , Sequence Analysis, DNA/methods , Animals , Mice

13.

Updated genome assembly and annotation of Paenibacillus larvae, the agent of American foulbrood disease of honey bees.

Chan, Queenie W T; Cornman, R Scott; Birol, Inanc; Liao, Nancy Y; Chan, Simon K; Docking, T Roderick; Jackman, Shaun D; Taylor, Greg A; Jones, Steven J M; de Graaf, Dirk C; Evans, Jay D; Foster, Leonard J.

BMC Genomics ; 12: 450, 2011 Sep 16.

Article in English | MEDLINE | ID: mdl-21923906

ABSTRACT

BACKGROUND: As scientists continue to pursue various 'omics-based research, there is a need for high quality data for the most fundamental 'omics of all: genomics. The bacterium Paenibacillus larvae is the causative agent of the honey bee disease American foulbrood. If untreated, it can lead to the demise of an entire hive; the highly social nature of bees also leads to easy disease spread, between both individuals and colonies. Biologists have studied this organism since the early 1900s, and a century later, the molecular mechanism of infection remains elusive. Transcriptomics and proteomics, because of their ability to analyze multiple genes and proteins in a high-throughput manner, may be very helpful to its study. However, the power of these methodologies is severely limited without a complete genome; we undertake to address that deficiency here. RESULTS: We used the Illumina GAIIx platform and conventional Sanger sequencing to generate a 182-fold sequence coverage of the P. larvae genome, and assembled the data using ABySS into a total of 388 contigs spanning 4.5 Mbp. Comparative genomics analysis against fully-sequenced soil bacteria P. JDR2 and P. vortex showed that regions of poor conservation may contain putative virulence factors. We used GLIMMER to predict 3568 gene models, and named them based on homology revealed by BLAST searches; proteases, hemolytic factors, toxins, and antibiotic resistance enzymes were identified in this way. Finally, mass spectrometry was used to provide experimental evidence that at least 35% of the genes are expressed at the protein level. CONCLUSIONS: This update on the genome of P. larvae and annotation represents an immense advancement from what we had previously known about this species. We provide here a reliable resource that can be used to elucidate the mechanism of infection, and by extension, more effective methods to control and cure this widespread honey bee disease.

Subject(s)

Bees/microbiology , Genome, Bacterial , Paenibacillus/genetics , Animals , Comparative Genomic Hybridization , Computational Biology , DNA, Bacterial/genetics , Molecular Sequence Annotation , Proteomics , Sequence Analysis, DNA

14.

De novo transcriptome assembly with ABySS.

Birol, Inanç; Jackman, Shaun D; Nielsen, Cydney B; Qian, Jenny Q; Varhol, Richard; Stazyk, Greg; Morin, Ryan D; Zhao, Yongjun; Hirst, Martin; Schein, Jacqueline E; Horsman, Doug E; Connors, Joseph M; Gascoyne, Randy D; Marra, Marco A; Jones, Steven J M.

Bioinformatics ; 25(21): 2872-7, 2009 Nov 01.

Article in English | MEDLINE | ID: mdl-19528083

ABSTRACT

MOTIVATION: Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS: Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION: Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT: ibirol@bcgsc.ca.

Subject(s)

Computational Biology/methods , Gene Expression Profiling , Software , Databases, Genetic , Genome , Sequence Analysis, DNA

15.

Complete Mitochondrial Genome of a Gymnosperm, Sitka Spruce (Picea sitchensis), Indicates a Complex Physical Structure.

Jackman, Shaun D; Coombe, Lauren; Warren, René L; Kirk, Heather; Trinh, Eva; MacLeod, Tina; Pleasance, Stephen; Pandoh, Pawan; Zhao, Yongjun; Coope, Robin J; Bousquet, Jean; Bohlmann, Joerg; Jones, Steven J M; Birol, Inanc.

Genome Biol Evol ; 12(7): 1174-1179, 2020 07 01.

Article in English | MEDLINE | ID: mdl-32449750

ABSTRACT

Plant mitochondrial genomes vary widely in size. Although many plant mitochondrial genomes have been sequenced and assembled, the vast majority are of angiosperms, and few are of gymnosperms. Most plant mitochondrial genomes are smaller than a megabase, with a few notable exceptions. We have sequenced and assembled the complete 5.5-Mb mitochondrial genome of Sitka spruce (Picea sitchensis), to date, one of the largest mitochondrial genomes of a gymnosperm. We sequenced the whole genome using Oxford Nanopore MinION, and then identified contigs of mitochondrial origin assembled from these long reads based on sequence homology to the white spruce mitochondrial genome. The assembly graph shows a multipartite genome structure, composed of one smaller 168-kb circular segment of DNA, and a larger 5.4-Mb single component with a branching structure. The assembly graph gives insight into a putative complex physical genome structure, and its branching points may represent active sites of recombination.

Subject(s)

Genome, Mitochondrial , Genome, Plant , Picea/genetics , Molecular Structure

16.

ABySS-Explorer: visualizing genome sequence assemblies.

Nielsen, Cydney B; Jackman, Shaun D; Birol, Inanç; Jones, Steven J M.

IEEE Trans Vis Comput Graph ; 15(6): 881-8, 2009.

Article in English | MEDLINE | ID: mdl-19834150

ABSTRACT

One bottleneck in large-scale genome sequencing projects is reconstructing the full genome sequence from the short subsequences produced by current technologies. The final stages of the genome assembly process inevitably require manual inspection of data inconsistencies and could be greatly aided by visualization. This paper presents our design decisions in translating key data features identified through discussions with analysts into a concise visual encoding. Current visualization tools in this domain focus on local sequence errors making high-level inspection of the assembly difficult if not impossible. We present a novel interactive graph display, ABySS-Explorer, that emphasizes the global assembly structure while also integrating salient data features such as sequence length. Our tool replaces manual and in some cases pen-and-paper based analysis tasks, and we discuss how user feedback was incorporated into iterative design refinements. Finally, we touch on applications of this representation not initially considered in our design phase, suggesting the generality of this encoding for DNA sequence data.

Subject(s)

Chromosome Mapping/methods , Computational Biology/methods , Computer Graphics , DNA/chemistry , Base Sequence

17.

Complete Chloroplast Genome Sequence of a White Spruce (Picea glauca, Genotype WS77111) from Eastern Canada.

Lin, Diana; Coombe, Lauren; Jackman, Shaun D; Gagalova, Kristina K; Warren, René L; Hammond, S Austin; Kirk, Heather; Pandoh, Pawan; Zhao, Yongjun; Moore, Richard A; Mungall, Andrew J; Ritland, Carol; Jaquish, Barry; Isabel, Nathalie; Bousquet, Jean; Jones, Steven J M; Bohlmann, Joerg; Birol, Inanc.

Microbiol Resour Announc ; 8(23)2019 Jun 06.

Article in English | MEDLINE | ID: mdl-31171622

ABSTRACT

Here, we present the complete chloroplast genome sequence of white spruce (Picea glauca, genotype WS77111), a coniferous tree widespread in the boreal forests of North America. This sequence contributes to genomic and phylogenetic analyses of the Picea genus that are part of ongoing research to understand their adaptation to environmental stress.

18.

The Genome of the Steller Sea Lion (Eumetopias jubatus).

Kwan, Harwood H; Culibrk, Luka; Taylor, Gregory A; Leelakumari, Sreeja; Tan, Ryan; Jackman, Shaun D; Tse, Kane; MacLeod, Tina; Cheng, Dean; Chuah, Eric; Kirk, Heather; Pandoh, Pawan; Carlsen, Rebecca; Zhao, Yongjun; Mungall, Andrew J; Moore, Richard; Birol, Inanc; Marra, Marco A; Rosen, David A S; Haulena, Martin; Jones, Steven J M.

Genes (Basel) ; 10(7)2019 06 26.

Article in English | MEDLINE | ID: mdl-31248052

ABSTRACT

The Steller sea lion is the largest member of the Otariidae family and is found in the coastal waters of the northern Pacific Rim. Here, we present the Steller sea lion genome, determined through DNA sequencing approaches that utilized microfluidic partitioning library construction, as well as nanopore technologies. These methods constructed a highly contiguous assembly with a scaffold N50 length of over 14 megabases, a contig N50 length of over 242 kilobases and a total length of 2.404 gigabases. As a measure of completeness, 95.1% of 4104 highly conserved mammalian genes were found to be complete within the assembly. Further annotation identified 19,668 protein coding genes. The assembled genome sequence and underlying sequence data can be found at the National Center for Biotechnology Information (NCBI) under the BioProject accession number PRJNA475770.

Subject(s)

Genome , Sea Lions/genetics , Animals , Genomic Library , Microfluidics/methods , Nanopores , Whole Genome Sequencing

19.

Complete Chloroplast Genome Sequence of an Engelmann Spruce (Picea engelmannii, Genotype Se404-851) from Western Canada.

Lin, Diana; Coombe, Lauren; Jackman, Shaun D; Gagalova, Kristina K; Warren, René L; Hammond, S Austin; McDonald, Helen; Kirk, Heather; Pandoh, Pawan; Zhao, Yongjun; Moore, Richard A; Mungall, Andrew J; Ritland, Carol; Doerksen, Trevor; Jaquish, Barry; Bousquet, Jean; Jones, Steven J M; Bohlmann, Joerg; Birol, Inanc.

Microbiol Resour Announc ; 8(24)2019 Jun 13.

Article in English | MEDLINE | ID: mdl-31196920

ABSTRACT

Engelmann spruce (Picea engelmannii) is a conifer found primarily on the west coast of North America. Here, we present the complete chloroplast genome sequence of Picea engelmannii genotype Se404-851. This chloroplast sequence will benefit future conifer genomic research and contribute resources to further species conservation efforts.

20.

The Genome of the North American Brown Bear or Grizzly: Ursus arctos ssp. horribilis.

Taylor, Gregory A; Kirk, Heather; Coombe, Lauren; Jackman, Shaun D; Chu, Justin; Tse, Kane; Cheng, Dean; Chuah, Eric; Pandoh, Pawan; Carlsen, Rebecca; Zhao, Yongjun; Mungall, Andrew J; Moore, Richard; Birol, Inanc; Franke, Maria; Marra, Marco A; Dutton, Christopher; Jones, Steven J M.

Genes (Basel) ; 9(12)2018 Nov 30.

Article in English | MEDLINE | ID: mdl-30513700

ABSTRACT

The grizzly bear (Ursus arctos ssp. horribilis) represents the largest population of brown bears in North America. Its genome was sequenced using a microfluidic partitioning library construction technique, and these data were supplemented with sequencing from a nanopore-based long read platform. The final assembly was 2.33 Gb with a scaffold N50 of 36.7 Mb, and the genome is of comparable size to that of its close relative the polar bear (2.30 Gb). An analysis using 4104 highly conserved mammalian genes indicated that 96.1% were found to be complete within the assembly. An automated annotation of the genome identified 19,848 protein coding genes. Our study shows that the combination of the two sequencing modalities that we used is sufficient for the construction of highly contiguous reference quality mammalian genomes. The assembled genome sequence and the supporting raw sequence reads are available from the NCBI (National Center for Biotechnology Information) under the bioproject identifier PRJNA493656, and the assembly described in this paper is version QXTK01000000.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL