Búsqueda | Portal de Búsqueda de la BVS

Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy.

Chandak, Shubham; Tatwawadi, Kedar; Sridhar, Srivatsan; Weissman, Tsachy.

Bioinformatics ; 36(22-23): 5313-5321, 2021 Apr 01.

Artículo en Inglés | MEDLINE | ID: mdl-33325499

RESUMEN

MOTIVATION: Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. RESULTS: We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (â²0.2% reduction) and consensus accuracy (â²0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. AVAILABILITYAND IMPLEMENTATION: The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

SPRING: a next-generation compressor for FASTQ data.

Chandak, Shubham; Tatwawadi, Kedar; Ochoa, Idoia; Hernaez, Mikel; Weissman, Tsachy.

Bioinformatics ; 35(15): 2674-2676, 2019 08 01.

Artículo en Inglés | MEDLINE | ID: mdl-30535063

RESUMEN

MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Genoma Humano , Genómica , Humanos , Análisis de Secuencia de ADN , Programas Informáticos

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Chandak, Shubham; Tatwawadi, Kedar; Weissman, Tsachy.

Bioinformatics ; 34(4): 558-567, 2018 02 15.

Artículo en Inglés | MEDLINE | ID: mdl-29444237

RESUMEN

Motivation: New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results: We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact: schandak@stanford.edu. Supplementary information: Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.

Asunto(s)

Algoritmos , Compresión de Datos/métodos , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Bacterias/genética , Eucariontes/genética , Genómica/métodos , Humanos , Programas Informáticos

GTRAC: fast retrieval from compressed collections of genomic variants.

Tatwawadi, Kedar; Hernaez, Mikel; Ochoa, Idoia; Weissman, Tsachy.

Bioinformatics ; 32(17): i479-i486, 2016 09 01.

Artículo en Inglés | MEDLINE | ID: mdl-27587665

RESUMEN

MOTIVATION: The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether. RESULTS: We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures. AVAILABILITY AND IMPLEMENTATION: The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Compresión de Datos , Genómica , Análisis de Secuencia de ADN , Bases de Datos Genéticas , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos

Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing.

Lau, Billy; Chandak, Shubham; Roy, Sharmili; Tatwawadi, Kedar; Wootters, Mary; Weissman, Tsachy; Ji, Hanlee P.

Sci Rep ; 13(1): 8514, 2023 05 25.

Artículo en Inglés | MEDLINE | ID: mdl-37231057

RESUMEN

The storage of data in DNA typically involves encoding and synthesizing data into short oligonucleotides, followed by reading with a sequencing instrument. Major challenges include the molecular consumption of synthesized DNA, basecalling errors, and limitations with scaling up read operations for individual data elements. Addressing these challenges, we describe a DNA storage system called MDRAM (Magnetic DNA-based Random Access Memory) that enables repetitive and efficient readouts of targeted files with nanopore-based sequencing. By conjugating synthesized DNA to magnetic agarose beads, we enabled repeated data readouts while preserving the original DNA analyte and maintaining data readout quality. MDRAM utilizes an efficient convolutional coding scheme that leverages soft information in raw nanopore sequencing signals to achieve information reading costs comparable to Illumina sequencing despite higher error rates. Finally, we demonstrate a proof-of-concept DNA-based proto-filesystem that enables an exponentially-scalable data address space using only small numbers of targeting primers for assembly and readout.

Asunto(s)

Nanoporos , ADN/genética , Análisis de Secuencia de ADN , Oligonucleótidos , Secuenciación de Nucleótidos de Alto Rendimiento , Fenómenos Magnéticos

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA