Search | VHL Search Portal

Tigmint: correcting assembly errors using linked reads from large molecules.

Jackman, Shaun D; Coombe, Lauren; Chu, Justin; Warren, Rene L; Vandervalk, Benjamin P; Yeo, Sarah; Xue, Zhuyi; Mohamadi, Hamid; Bohlmann, Joerg; Jones, Steven J M; Birol, Inanc.

BMC Bioinformatics ; 19(1): 393, 2018 Oct 26.

Article in English | MEDLINE | ID: mdl-30367597

ABSTRACT

BACKGROUND: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap. RESULTS: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. CONCLUSIONS: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Software , Chromosomes, Human/genetics , Genome, Human , Genomics , Humans , Nanopores , Repetitive Sequences, Nucleic Acid

Recurrent tumor-specific regulation of alternative polyadenylation of cancer-related genes.

Xue, Zhuyi; Warren, René L; Gibb, Ewan A; MacMillan, Daniel; Wong, Johnathan; Chiu, Readman; Hammond, S Austin; Yang, Chen; Nip, Ka Ming; Ennis, Catherine A; Hahn, Abigail; Reynolds, Sheila; Birol, Inanc.

BMC Genomics ; 19(1): 536, 2018 Jul 13.

Article in English | MEDLINE | ID: mdl-30005633

ABSTRACT

BACKGROUND: Alternative polyadenylation (APA) results in messenger RNA molecules with different 3' untranslated regions (3' UTRs), affecting the molecules' stability, localization, and translation. APA is pervasive and implicated in cancer. Earlier reports on APA focused on 3' UTR length modifications and commonly characterized APA events as 3' UTR shortening or lengthening. However, such characterization oversimplifies the processing of 3' ends of transcripts and fails to adequately describe the various scenarios we observe. RESULTS: We built a cloud-based targeted de novo transcript assembly and analysis pipeline that incorporates our previously developed cleavage site prediction tool, KLEAT. We applied this pipeline to elucidate the APA profiles of 114 genes in 9939 tumor and 729 tissue normal samples from The Cancer Genome Atlas (TCGA). The full set of 10,668 RNA-Seq samples from 33 cancer types has not been utilized by previous APA studies. By comparing the frequencies of predicted cleavage sites between normal and tumor sample groups, we identified 77 events (i.e. gene-cancer type pairs) of tumor-specific APA regulation in 13 cancer types; for 15 genes, such regulation is recurrent across multiple cancers. Our results also support a previous report showing the 3' UTR shortening of FGF2 in multiple cancers. However, over half of the events we identified display complex changes to 3' UTR length that resist simple classification like shortening or lengthening. CONCLUSIONS: Recurrent tumor-specific regulation of APA is widespread in cancer. However, the regulation pattern that we observed in TCGA RNA-seq data cannot be described as straightforward 3' UTR shortening or lengthening. Continued investigation into this complex, nuanced regulatory landscape will provide further insight into its role in tumor formation and development.

Subject(s)

Neoplasms/genetics , RNA, Messenger/genetics , 3' Untranslated Regions , Cloud Computing , Databases, Genetic , Fibroblast Growth Factor 2/genetics , Gene Expression Regulation, Neoplastic , Humans , Neoplasm Recurrence, Local/genetics , Neoplasms/pathology , Polyadenylation , RNA Cleavage , RNA, Messenger/metabolism , Software

Konnector v2.0: pseudo-long reads from paired-end sequencing data.

Vandervalk, Benjamin P; Yang, Chen; Xue, Zhuyi; Raghavan, Karthika; Chu, Justin; Mohamadi, Hamid; Jackman, Shaun D; Chiu, Readman; Warren, René L; Birol, Inanç.

BMC Med Genomics ; 8 Suppl 3: S1, 2015.

Article in English | MEDLINE | ID: mdl-26399504

ABSTRACT

BACKGROUND: Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool. RESULTS: Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences. CONCLUSIONS: Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.

Subject(s)

Sequence Analysis, DNA/methods , Software , Algorithms , DNA/chemistry , High-Throughput Nucleotide Sequencing

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL