Search | VHL Regional Portal

1.

ntHash2: recursive spaced seed hashing for nucleotide sequences.

Kazemi, Parham; Wong, Johnathan; Nikolic, Vladimir; Mohamadi, Hamid; Warren, René L; Birol, Inanç.

Bioinformatics ; 38(20): 4812-4813, 2022 10 14.

Article in English | MEDLINE | ID: mdl-36000872

ABSTRACT

MOTIVATION: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. RESULTS: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. AVAILABILITY AND IMPLEMENTATION: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Software , Base Sequence , Seeds , Sequence Analysis, DNA

2.

Spruce giga-genomes: structurally similar yet distinctive with differentially expanding gene families and rapidly evolving genes.

Gagalova, Kristina K; Warren, René L; Coombe, Lauren; Wong, Johnathan; Nip, Ka Ming; Yuen, Macaire Man Saint; Whitehill, Justin G A; Celedon, Jose M; Ritland, Carol; Taylor, Greg A; Cheng, Dean; Plettner, Patrick; Hammond, S Austin; Mohamadi, Hamid; Zhao, Yongjun; Moore, Richard A; Mungall, Andrew J; Boyle, Brian; Laroche, Jérôme; Cottrell, Joan; Mackay, John J; Lamothe, Manuel; Gérardi, Sébastien; Isabel, Nathalie; Pavy, Nathalie; Jones, Steven J M; Bohlmann, Joerg; Bousquet, Jean; Birol, Inanc.

Plant J ; 111(5): 1469-1485, 2022 09.

Article in English | MEDLINE | ID: mdl-35789009

ABSTRACT

Spruces (Picea spp.) are coniferous trees widespread in boreal and mountainous forests of the northern hemisphere, with large economic significance and enormous contributions to global carbon sequestration. Spruces harbor very large genomes with high repetitiveness, hampering their comparative analysis. Here, we present and compare the genomes of four different North American spruces: the genome assemblies for Engelmann spruce (Picea engelmannii) and Sitka spruce (Picea sitchensis) together with improved and more contiguous genome assemblies for white spruce (Picea glauca) and for a naturally occurring introgress of these three species known as interior spruce (P. engelmannii × glauca × sitchensis). The genomes were structurally similar, and a large part of scaffolds could be anchored to a genetic map. The composition of the interior spruce genome indicated asymmetric contributions from the three ancestral genomes. Phylogenetic analysis of the nuclear and organelle genomes revealed a topology indicative of ancient reticulation. Different patterns of expansion of gene families among genomes were observed and related with presumed diversifying ecological adaptations. We identified rapidly evolving genes that harbored high rates of non-synonymous polymorphisms relative to synonymous ones, indicative of positive selection and its hitchhiking effects. These gene sets were mostly distinct between the genomes of ecologically contrasted species, and signatures of convergent balancing selection were detected. Stress and stimulus response was identified as the most frequent function assigned to expanding gene families and rapidly evolving genes. These two aspects of genomic evolution were complementary in their contribution to divergent evolution of presumed adaptive nature. These more contiguous spruce giga-genome sequences should strengthen our understanding of conifer genome structure and evolution, as their comparison offers clues into the genetic basis of adaptation and ecology of conifers at the genomic level. They will also provide tools to better monitor natural genetic diversity and improve the management of conifer forests. The genomes of four closely related North American spruces indicate that their high similarity at the morphological level is paralleled by the high conservation of their physical genome structure. Yet, the evidence of divergent evolution is apparent in their rapidly evolving genomes, supported by differential expansion of key gene families and large sets of genes under positive selection, largely in relation to stimulus and environmental stress response.

Subject(s)

Picea , Tracheophyta , Expressed Sequence Tags , Genome, Plant/genetics , Multigene Family/genetics , Phylogeny , Picea/genetics , Tracheophyta/genetics

3.

RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes.

Nip, Ka Ming; Chiu, Readman; Yang, Chen; Chu, Justin; Mohamadi, Hamid; Warren, René L; Birol, Inanc.

Genome Res ; 30(8): 1191-1200, 2020 08.

Article in English | MEDLINE | ID: mdl-32817073

ABSTRACT

Despite the rapid advance in single-cell RNA sequencing (scRNA-seq) technologies within the last decade, single-cell transcriptome analysis workflows have primarily used gene expression data while isoform sequence analysis at the single-cell level still remains fairly limited. Detection and discovery of isoforms in single cells is difficult because of the inherent technical shortcomings of scRNA-seq data, and existing transcriptome assembly methods are mainly designed for bulk RNA samples. To address this challenge, we developed RNA-Bloom, an assembly algorithm that leverages the rich information content aggregated from multiple single-cell transcriptomes to reconstruct cell-specific isoforms. Assembly with RNA-Bloom can be either reference-guided or reference-free, thus enabling unbiased discovery of novel isoforms or foreign transcripts. We compared both assembly strategies of RNA-Bloom against five state-of-the-art reference-free and reference-based transcriptome assembly methods. In our benchmarks on a simulated 384-cell data set, reference-free RNA-Bloom reconstructed 37.9%-38.3% more isoforms than the best reference-free assembler, whereas reference-guided RNA-Bloom reconstructed 4.1%-11.6% more isoforms than reference-based assemblers. When applied to a real 3840-cell data set consisting of more than 4 billion reads, RNA-Bloom reconstructed 9.7%-25.0% more isoforms than the best competing reference-based and reference-free approaches evaluated. We expect RNA-Bloom to boost the utility of scRNA-seq data beyond gene expression analysis, expanding what is informatically accessible now.

Subject(s)

Gene Expression Profiling/methods , RNA-Seq/methods , Single-Cell Analysis/methods , Transcriptome/genetics , Algorithms , Animals , Base Sequence , Humans , Mice , Protein Isoforms/genetics , Software

4.

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters.

Chu, Justin; Mohamadi, Hamid; Erhan, Emre; Tse, Jeffery; Chiu, Readman; Yeo, Sarah; Birol, Inanc.

Proc Natl Acad Sci U S A ; 117(29): 16961-16968, 2020 07 21.

Article in English | MEDLINE | ID: mdl-32641514

ABSTRACT

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.

Subject(s)

Sequence Analysis, DNA/methods , Software , Animals , Base Pair Mismatch , Humans , Phylogeny , Sequence Alignment , Sequence Analysis, DNA/standards

5.

ntEdit: scalable genome sequence polishing.

Warren, René L; Coombe, Lauren; Mohamadi, Hamid; Zhang, Jessica; Jaquish, Barry; Isabel, Nathalie; Jones, Steven J M; Bousquet, Jean; Bohlmann, Joerg; Birol, Inanç.

Bioinformatics ; 35(21): 4430-4432, 2019 11 01.

Article in English | MEDLINE | ID: mdl-31095290

ABSTRACT

MOTIVATION: In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. RESULTS: We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. AVAILABILITY AND IMPLEMENTATION: https://github.com/bcgsc/ntedit. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , High-Throughput Nucleotide Sequencing , Animals , Genome, Human , Haploidy , Humans , Sequence Analysis, DNA , Software

6.

Tigmint: correcting assembly errors using linked reads from large molecules.

Jackman, Shaun D; Coombe, Lauren; Chu, Justin; Warren, Rene L; Vandervalk, Benjamin P; Yeo, Sarah; Xue, Zhuyi; Mohamadi, Hamid; Bohlmann, Joerg; Jones, Steven J M; Birol, Inanc.

BMC Bioinformatics ; 19(1): 393, 2018 Oct 26.

Article in English | MEDLINE | ID: mdl-30367597

ABSTRACT

BACKGROUND: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap. RESULTS: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. CONCLUSIONS: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Software , Chromosomes, Human/genetics , Genome, Human , Genomics , Humans , Nanopores , Repetitive Sequences, Nucleic Acid

7.

ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data.

Khan, Hamza; Mohamadi, Hamid; Vandervalk, Benjamin P; Warren, Rene L; Chu, Justin; Birol, Inanc.

Bioinformatics ; 34(10): 1697-1704, 2018 05 15.

Article in English | MEDLINE | ID: mdl-29300846

ABSTRACT

Motivation: Sequencing studies on non-model organisms often interrogate both genomes and transcriptomes with massive amounts of short sequences. Such studies require de novo analysis tools and techniques, when the species and closely related species lack high quality reference resources. For certain applications such as de novo annotation, information on putative exons and alternative splicing may be desirable. Results: Here we present ChopStitch, a new method for finding putative exons de novo and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also accounts for base substitutions in transcript sequences that may be derived from sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are represented as splice graphs in DOT output format. Availability and implementation: ChopStitch is written in Python and C++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ChopStitch. Contact: hkhan@bcgsc.ca or ibirol@bcgsc.ca. Supplementary information: Supplementary data are available at Bioinformatics online.

Subject(s)

Alternative Splicing , Exons , Transcriptome , Whole Genome Sequencing , Algorithms , Genome , High-Throughput Nucleotide Sequencing/methods , RNA , Sequence Analysis, RNA/methods , Software

8.

ntCard: a streaming algorithm for cardinality estimation in genomics data.

Mohamadi, Hamid; Khan, Hamza; Birol, Inanc.

Bioinformatics ; 33(9): 1324-1330, 2017 05 01.

Article in English | MEDLINE | ID: mdl-28453674

ABSTRACT

Motivation: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task. Results: Here, we present ntCard, a streaming algorithm for estimating the frequencies of k -mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k -mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications. Availability and Implementation: ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. Contact: hmohamadi@bcgsc.ca or ibirol@bcgsc.ca. Supplementary information: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Genome Size , Genome, Human , Genome, Plant , Humans , Models, Statistical , Picea/genetics

9.

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter.

Jackman, Shaun D; Vandervalk, Benjamin P; Mohamadi, Hamid; Chu, Justin; Yeo, Sarah; Hammond, S Austin; Jahesh, Golnaz; Khan, Hamza; Coombe, Lauren; Warren, Rene L; Birol, Inanc.

Genome Res ; 27(5): 768-777, 2017 05.

Article in English | MEDLINE | ID: mdl-28232478

ABSTRACT

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

Subject(s)

Contig Mapping/methods , Genomics/methods , Software , Contig Mapping/standards , Genome Size , Genomics/standards , Humans , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards

10.

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.

Chu, Justin; Mohamadi, Hamid; Warren, René L; Yang, Chen; Birol, Inanç.

Bioinformatics ; 33(8): 1261-1270, 2017 Apr 15.

Article in English | MEDLINE | ID: mdl-28003261

ABSTRACT

Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. CONTACT: cjustin@bcgsc.ca , ibirol@bcgsc.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Algorithms

11.

ntHash: recursive nucleotide hashing.

Mohamadi, Hamid; Chu, Justin; Vandervalk, Benjamin P; Birol, Inanc.

Bioinformatics ; 32(22): 3492-3494, 2016 11 15.

Article in English | MEDLINE | ID: mdl-27423894

ABSTRACT

MOTIVATION: Hashing has been widely used for indexing, querying and rapid similarity search in many bioinformatics applications, including sequence alignment, genome and transcriptome assembly, k-mer counting and error correction. Hence, expediting hashing operations would have a substantial impact in the field, making bioinformatics applications faster and more efficient. RESULTS: We present ntHash, a hashing algorithm tuned for processing DNA/RNA sequences. It performs the best when calculating hash values for adjacent k-mers in an input sequence, operating an order of magnitude faster than the best performing alternatives in typical use cases. AVAILABILITY AND IMPLEMENTATION: ntHash is available online at http://www.bcgsc.ca/platform/bioinfo/software/nthash and is free for academic use. CONTACTS: hmohamadi@bcgsc.ca or ibirol@bcgsc.caSupplementary information: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Nucleotides , Animals , Humans , Sequence Alignment , Sequence Analysis, DNA , Software

12.

Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation.

Jackman, Shaun D; Warren, René L; Gibb, Ewan A; Vandervalk, Benjamin P; Mohamadi, Hamid; Chu, Justin; Raymond, Anthony; Pleasance, Stephen; Coope, Robin; Wildung, Mark R; Ritland, Carol E; Bousquet, Jean; Jones, Steven J M; Bohlmann, Joerg; Birol, Inanç.

Genome Biol Evol ; 8(1): 29-41, 2015 Dec 08.

Article in English | MEDLINE | ID: mdl-26645680

ABSTRACT

The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.

Subject(s)

Genome, Chloroplast , Genome, Mitochondrial , Genome, Plant , Picea/genetics , Base Sequence , Contig Mapping , Molecular Sequence Annotation , Molecular Sequence Data

13.

Spaced Seed Data Structures for De Novo Assembly.

Birol, Inanç; Chu, Justin; Mohamadi, Hamid; Jackman, Shaun D; Raghavan, Karthika; Vandervalk, Benjamin P; Raymond, Anthony; Warren, René L.

Int J Genomics ; 2015: 196591, 2015.

Article in English | MEDLINE | ID: mdl-26539459

ABSTRACT

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

14.

Konnector v2.0: pseudo-long reads from paired-end sequencing data.

Vandervalk, Benjamin P; Yang, Chen; Xue, Zhuyi; Raghavan, Karthika; Chu, Justin; Mohamadi, Hamid; Jackman, Shaun D; Chiu, Readman; Warren, René L; Birol, Inanç.

BMC Med Genomics ; 8 Suppl 3: S1, 2015.

Article in English | MEDLINE | ID: mdl-26399504

ABSTRACT

BACKGROUND: Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool. RESULTS: Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences. CONCLUSIONS: Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.

Subject(s)

Sequence Analysis, DNA/methods , Software , Algorithms , DNA/chemistry , High-Throughput Nucleotide Sequencing

15.

Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg.

Plant J ; 83(2): 189-212, 2015 Jul.

Article in English | MEDLINE | ID: mdl-26017574

ABSTRACT

White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

Subject(s)

Genome, Plant , Multigene Family , Phenols/metabolism , Picea/genetics , Terpenes/metabolism , Alkyl and Aryl Transferases/metabolism , Computational Biology , Cytochrome P-450 Enzyme System/metabolism , Transcriptome

16.

DIDA: Distributed Indexing Dispatched Alignment.

Mohamadi, Hamid; Vandervalk, Benjamin P; Raymond, Anthony; Jackman, Shaun D; Chu, Justin; Breshears, Clay P; Birol, Inanc.

PLoS One ; 10(4): e0126409, 2015.

Article in English | MEDLINE | ID: mdl-25923767

ABSTRACT

One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use.

Subject(s)

Computational Biology/methods , Databases, Genetic , Sequence Alignment/methods , Software , Humans

17.

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters.

Chu, Justin; Sadeghi, Sara; Raymond, Anthony; Jackman, Shaun D; Nip, Ka Ming; Mar, Richard; Mohamadi, Hamid; Butterfield, Yaron S; Robertson, A Gordon; Birol, Inanç.

Bioinformatics ; 30(23): 3402-4, 2014 Dec 01.

Article in English | MEDLINE | ID: mdl-25143290

ABSTRACT

Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools.

Subject(s)

Sequence Analysis, DNA/methods , Software , Algorithms , Animals , Humans , Mice

18.

BOND: Basic OligoNucleotide Design.

Ilie, Lucian; Mohamadi, Hamid; Golding, Geoffrey Brian; Smyth, William F.

BMC Bioinformatics ; 14: 69, 2013 Feb 27.

Article in English | MEDLINE | ID: mdl-23444904

ABSTRACT

BACKGROUND: DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, that is, all genes that allow such unique regions, and (iii) are computed fast. None of the existing programs meet all these criteria. RESULTS: We introduce a new approach with our software program BOND (Basic OligoNucleotide Design). According to Kane's criteria for oligo design, BOND computes highly specific DNA oligonucleotides, for all the genes that admit unique probes, while running orders of magnitude faster than the existing programs. The same approach enables us to introduce also an evaluation procedure that correctly measures the quality of the oligonucleotides. Extensive comparison is performed to prove our claims. BOND is flexible, easy to use, requires no additional software, and is freely available for non-commercial use from http://www.csd.uwo.ca/â¼ilie/BOND/. CONCLUSIONS: We provide an improved solution to the important problem of oligonucleotide design, including a thorough evaluation of oligo design programs. We hope BOND will become a useful tool for researchers in biological and medical sciences by making the microarray procedures faster and more accurate.

Subject(s)

Oligonucleotides/chemistry , Software , Algorithms , Genes , Oligonucleotide Array Sequence Analysis/methods , Oligonucleotide Probes/chemistry , Oligonucleotide Probes/genetics , Oligonucleotides/genetics

19.

Investigation of chromatography and polymer/salt aqueous two-phase processes for downstream processing development of recombinant phenylalanine dehydrogenase.

Omidinia, Eskandar; Shahbaz Mohamadi, Hamid; Dinarvand, Rassoul; Taherkhani, Heshmat-Allah.

Bioprocess Biosyst Eng ; 33(3): 317-29, 2010 Mar.

Article in English | MEDLINE | ID: mdl-19495799

ABSTRACT

This work presents a comprehensive study between the polymer/salt aqueous two-phase systems (ATPS) and chromatography process for downstream processing of recombinant Bacillus badius phenylalanine dehydrogenase (PheDH). First, the partitioning behavior of recombinant PheDH in polyethylene glycol (PEG)/K2HPO4 ATPS was examined. For comparative purpose, a classical chromatographic protocol was performed as well. Investigation of chromatography and ATPS procedures revealed that the ATPS comprising of 9% (w/w) PEG-6000, 16% (w/w) K2HPO4 and 16% (w/w) KCl with pH of 8.0, volume ratio (V ( R )) of 0.25, temperature of 25 degrees C and 40% (w/w) cell lysate ensured the most favorable approach for PheDH downstream process. A specific activity of 4,231.4 U/mg, a yield of 96.7% and a recovery of 162.0% were obtained. Furthermore, the shorter process time (4 vs. 48 h) and the lower total cost (4 vs. 20 euro) were additionally features that confirmed the suitability of proposed technique.

Subject(s)

Amino Acid Oxidoreductases/chemistry , Biotechnology/methods , Chromatography/methods , Polymers/chemistry , Recombinant Proteins/chemistry , Dose-Response Relationship, Drug , Escherichia coli/metabolism , Hydrogen-Ion Concentration , Kinetics , Molecular Weight , Polyethylene Glycols/chemistry , Salts/chemistry , Water/chemistry

20.

Purification of recombinant phenylalanine dehydrogenase by partitioning in aqueous two-phase systems.

Mohamadi, Hamid Shahbaz; Omidinia, Eskandar.

J Chromatogr B Analyt Technol Biomed Life Sci ; 854(1-2): 273-8, 2007 Jul 01.

Article in English | MEDLINE | ID: mdl-17537685

ABSTRACT

This study presents the partitioning and purification of recombinant Bacillus badius phenylalanine dehydrogenase (PheDH) in aqueous two-phase systems (ATPS) composed of polyethylene glycol 6000 (PEG-6000) and ammonium sulfate. A single-step operation of ATPS was developed for extraction and purification of recombinant PheDH from E. coli BL21 (DE3). The influence of system parameters including; PEG molecular weight and concentration, pH, (NH(4))(2)SO(4) concentration and NaCl salt addition on enzyme partitioning were investigated. The best optimal system for the partitioning and purification of PheDH was 8.5% (w/w) PEG-6000, 17.5% (w/w) (NH(4))(2)SO(4) and 13% (w/w) NaCl at pH 8.0. The partition coefficient, recovery, yield, purification factor and specific activity values were of 92.57, 141%, 95.85%, 474.3 and 10424.97 U/mg, respectively. Also the K(m) values for L-phenylalanine and NAD(+) in oxidative deamination were 0.020 and 0.13 mM, respectively. Our data suggested that this ATPS could be an economical and attractive technology for large-scale purification of recombinant PheDH.

Subject(s)

Amino Acid Oxidoreductases/isolation & purification , Bacillus/enzymology , Electrophoresis, Polyacrylamide Gel , Molecular Weight , Recombinant Proteins/isolation & purification , Water

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL