Your browser doesn't support javascript.
loading
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.
Sun, Hui; Zheng, Yingfeng; Xie, Haonan; Ma, Huidong; Zhong, Cheng; Yan, Meng; Liu, Xiaoguang; Wang, Gang.
Afiliación
  • Sun H; Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
  • Zheng Y; Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
  • Xie H; Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China.
  • Ma H; Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
  • Zhong C; Key Laboratory of Parallel, Distributed and Intelligent of Guangxi Universities and Colleges, School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China.
  • Yan M; Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
  • Liu X; Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
  • Wang G; Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
Bioinformatics ; 40(5)2024 May 02.
Article en En | MEDLINE | ID: mdl-38759114
ABSTRACT
MOTIVATION The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters.

RESULTS:

We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely. AVAILABILITY AND IMPLEMENTATION The proposed PQSDC compressor can be downloaded from https//github.com/fahaihi/PQSDC.
Asunto(s)

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Algoritmos / Compresión de Datos Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: China

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Algoritmos / Compresión de Datos Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: China