Your browser doesn't support javascript.
loading
CALQ: compression of quality values of aligned sequencing data.
Voges, Jan; Ostermann, Jörn; Hernaez, Mikel.
Afiliación
  • Voges J; Fakultät für Elektrotechnik und Informatik, Institut für Informationsverarbeitung (TNT), Leibniz Universität Hannover, 30167 Hannover, Germany.
  • Ostermann J; Fakultät für Elektrotechnik und Informatik, Institut für Informationsverarbeitung (TNT), Leibniz Universität Hannover, 30167 Hannover, Germany.
  • Hernaez M; Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, IL 61801, USA.
Bioinformatics ; 34(10): 1650-1658, 2018 05 15.
Article en En | MEDLINE | ID: mdl-29186284
ABSTRACT
Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses.

Results:

We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https//github.com/voges/calq. Contact voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Asunto(s)

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Genómica / Compresión de Datos / Secuenciación de Nucleótidos de Alto Rendimiento Tipo de estudio: Risk_factors_studies Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2018 Tipo del documento: Article País de afiliación: Alemania

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Genómica / Compresión de Datos / Secuenciación de Nucleótidos de Alto Rendimiento Tipo de estudio: Risk_factors_studies Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2018 Tipo del documento: Article País de afiliación: Alemania