Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.

Liu, Yuansheng; Yu, Zuguo; Dinger, Marcel E; Li, Jinyan

Liu, Yuansheng; Yu, Zuguo; Dinger, Marcel E; Li, Jinyan.

Afiliação

Liu Y; Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia.
Yu Z; Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.
Dinger ME; School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia.
Li J; Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.

Bioinformatics ; 35(12): 2066-2074, 2019 06 01.

Article em En | MEDLINE | ID: mdl-30407482

ABSTRACT

ABSTRACT

MOTIVATION Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs.

RESULTS:

We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. AVAILABILITY AND IMPLEMENTATION https//github.com/yuansliu/minicom. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados; Software; Algoritmos; Sequenciamento de Nucleotídeos em Larga Escala; Análise de Sequência de DNA; Sequenciamento Completo do Genoma

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Compressão de Dados Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Compressão de Dados Idioma: En Ano de publicação: 2019 Tipo de documento: Article