Pesquisa | Portal Regional da BVS

Genie: the first open-source ISO/IEC encoder for genomic data.

Müntefering, Fabian; Adhisantoso, Yeremia Gunawan; Chandak, Shubham; Ostermann, Jörn; Hernaez, Mikel; Voges, Jan.

Commun Biol ; 7(1): 553, 2024 May 09.

Artigo em Inglês | MEDLINE | ID: mdl-38724695

RESUMO

For the last two decades, the amount of genomic data produced by scientific and medical applications has been growing at a rapid pace. To enable software solutions that analyze, process, and transmit these data in an efficient and interoperable way, ISO and IEC released the first version of the compression standard MPEG-G in 2019. However, non-proprietary implementations of the standard are not openly available so far, limiting fair scientific assessment of the standard and, therefore, hindering its broad adoption. In this paper, we present Genie, to the best of our knowledge the first open-source encoder that compresses genomic data according to the MPEG-G standard. We demonstrate that Genie reaches state-of-the-art compression ratios while offering interoperability with any other standard-compliant decoder independent from its manufacturer. Finally, the ISO/IEC ecosystem ensures the long-term sustainability and decodability of the compressed data through the ISO/IEC-supported reference decoder.

Assuntos

Compressão de Dados , Genômica , Software , Genômica/métodos , Compressão de Dados/métodos , Humanos

Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing.

Lau, Billy; Chandak, Shubham; Roy, Sharmili; Tatwawadi, Kedar; Wootters, Mary; Weissman, Tsachy; Ji, Hanlee P.

Sci Rep ; 13(1): 8514, 2023 05 25.

Artigo em Inglês | MEDLINE | ID: mdl-37231057

RESUMO

The storage of data in DNA typically involves encoding and synthesizing data into short oligonucleotides, followed by reading with a sequencing instrument. Major challenges include the molecular consumption of synthesized DNA, basecalling errors, and limitations with scaling up read operations for individual data elements. Addressing these challenges, we describe a DNA storage system called MDRAM (Magnetic DNA-based Random Access Memory) that enables repetitive and efficient readouts of targeted files with nanopore-based sequencing. By conjugating synthesized DNA to magnetic agarose beads, we enabled repeated data readouts while preserving the original DNA analyte and maintaining data readout quality. MDRAM utilizes an efficient convolutional coding scheme that leverages soft information in raw nanopore sequencing signals to achieve information reading costs comparable to Illumina sequencing despite higher error rates. Finally, we demonstrate a proof-of-concept DNA-based proto-filesystem that enables an exponentially-scalable data address space using only small numbers of targeting primers for assembly and readout.

Assuntos

Nanoporos , DNA/genética , Análise de Sequência de DNA , Oligonucleotídeos , Sequenciamento de Nucleotídeos em Larga Escala , Fenômenos Magnéticos

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.

Meng, Qingxi; Chandak, Shubham; Zhu, Yifan; Weissman, Tsachy.

Sci Rep ; 13(1): 2082, 2023 02 06.

Artigo em Inglês | MEDLINE | ID: mdl-36747011

RESUMO

The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .

Assuntos

Compressão de Dados , Sequenciamento por Nanoporos , Humanos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma Humano , Análise de Sequência de DNA

Expanding the Molecular Alphabet of DNA-Based Data Storage Systems with Neural Network Nanopore Readout Processing.

Tabatabaei, S Kasra; Pham, Bach; Pan, Chao; Liu, Jingqian; Chandak, Shubham; Shorkey, Spencer A; Hernandez, Alvaro G; Aksimentiev, Aleksei; Chen, Min; Schroeder, Charles M; Milenkovic, Olgica.

Nano Lett ; 22(5): 1905-1914, 2022 03 09.

Artigo em Inglês | MEDLINE | ID: mdl-35212544

RESUMO

DNA is a promising next-generation data storage medium, but challenges remain with synthesis costs and recording latency. Here, we describe a prototype of a DNA data storage system that uses an extended molecular alphabet combining natural and chemically modified nucleotides. Our results show that MspA nanopores can discriminate different combinations and ordered sequences of natural and chemically modified nucleotides in custom-designed oligomers. We further demonstrate single-molecule sequencing of the extended alphabet using a neural network architecture that classifies raw current signals generated by Oxford Nanopore sequencers with an average accuracy exceeding 60% (39× larger than random guessing). Molecular dynamics simulations show that the majority of modified nucleotides lead to only minor perturbations of the DNA double helix. Overall, the extended molecular alphabet may potentially offer a nearly 2-fold increase in storage density and potentially the same order of reduction in the recording latency, thereby enabling new implementations of molecular recorders.

Assuntos

Nanoporos , DNA/genética , Sistemas de Dados , Armazenamento e Recuperação da Informação , Redes Neurais de Computação , Nucleotídeos/química , Nucleotídeos/genética , Análise de Sequência de DNA/métodos

Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy.

Chandak, Shubham; Tatwawadi, Kedar; Sridhar, Srivatsan; Weissman, Tsachy.

Bioinformatics ; 36(22-23): 5313-5321, 2021 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-33325499

RESUMO

MOTIVATION: Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. RESULTS: We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (â²0.2% reduction) and consensus accuracy (â²0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. AVAILABILITYAND IMPLEMENTATION: The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

SPRING: a next-generation compressor for FASTQ data.

Chandak, Shubham; Tatwawadi, Kedar; Ochoa, Idoia; Hernaez, Mikel; Weissman, Tsachy.

Bioinformatics ; 35(15): 2674-2676, 2019 08 01.

Artigo em Inglês | MEDLINE | ID: mdl-30535063

RESUMO

MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genoma Humano , Genômica , Humanos , Análise de Sequência de DNA , Software

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Chandak, Shubham; Tatwawadi, Kedar; Weissman, Tsachy.

Bioinformatics ; 34(4): 558-567, 2018 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-29444237

RESUMO

Motivation: New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results: We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact: schandak@stanford.edu. Supplementary information: Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.

Assuntos

Algoritmos , Compressão de Dados/métodos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Bactérias/genética , Eucariotos/genética , Genômica/métodos , Humanos , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA