Pesquisa | Portal Regional da BVS

1.

Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome.

Lee, HoJoon; Greer, Stephanie U; Pavlichin, Dmitri S; Zhou, Bo; Urban, Alexander E; Weissman, Tsachy; Ji, Hanlee P.

Cell Rep Methods ; 3(8): 100543, 2023 08 28.

Artigo em Inglês | MEDLINE | ID: mdl-37671027

RESUMO

The human pangenome, a new reference sequence, addresses many limitations of the current GRCh38 reference. The first release is based on 94 high-quality haploid assemblies from individuals with diverse backgrounds. We employed a k-mer indexing strategy for comparative analysis across multiple assemblies, including the pangenome reference, GRCh38, and CHM13, a telomere-to-telomere reference assembly. Our k-mer indexing approach enabled us to identify a valuable collection of universally conserved sequences across all assemblies, referred to as "pan-conserved segment tags" (PSTs). By examining intervals between these segments, we discerned highly conserved genomic segments and those with structurally related polymorphisms. We found 60,764 polymorphic intervals with unique geo-ethnic features in the pangenome reference. In this study, we utilized ultra-conserved sequences (PSTs) to forge a link between human pangenome assemblies and reference genomes. This methodology enables the examination of any sequence of interest within the pangenome, using the reference genome as a comparative framework.

Assuntos

Neoplasias de Células Escamosas , Neoplasias Cutâneas , Humanos , Sequência Conservada , Haploidia , Polimorfismo Genético

2.

Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing.

Lau, Billy; Chandak, Shubham; Roy, Sharmili; Tatwawadi, Kedar; Wootters, Mary; Weissman, Tsachy; Ji, Hanlee P.

Sci Rep ; 13(1): 8514, 2023 05 25.

Artigo em Inglês | MEDLINE | ID: mdl-37231057

RESUMO

The storage of data in DNA typically involves encoding and synthesizing data into short oligonucleotides, followed by reading with a sequencing instrument. Major challenges include the molecular consumption of synthesized DNA, basecalling errors, and limitations with scaling up read operations for individual data elements. Addressing these challenges, we describe a DNA storage system called MDRAM (Magnetic DNA-based Random Access Memory) that enables repetitive and efficient readouts of targeted files with nanopore-based sequencing. By conjugating synthesized DNA to magnetic agarose beads, we enabled repeated data readouts while preserving the original DNA analyte and maintaining data readout quality. MDRAM utilizes an efficient convolutional coding scheme that leverages soft information in raw nanopore sequencing signals to achieve information reading costs comparable to Illumina sequencing despite higher error rates. Finally, we demonstrate a proof-of-concept DNA-based proto-filesystem that enables an exponentially-scalable data address space using only small numbers of targeting primers for assembly and readout.

Assuntos

Nanoporos , DNA/genética , Análise de Sequência de DNA , Oligonucleotídeos , Sequenciamento de Nucleotídeos em Larga Escala , Fenômenos Magnéticos

3.

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.

Meng, Qingxi; Chandak, Shubham; Zhu, Yifan; Weissman, Tsachy.

Sci Rep ; 13(1): 2082, 2023 02 06.

Artigo em Inglês | MEDLINE | ID: mdl-36747011

RESUMO

The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .

Assuntos

Compressão de Dados , Sequenciamento por Nanoporos , Humanos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma Humano , Análise de Sequência de DNA

4.

The Human Pangenome Project: a global resource to map genomic diversity.

Wang, Ting; Antonacci-Fulton, Lucinda; Howe, Kerstin; Lawson, Heather A; Lucas, Julian K; Phillippy, Adam M; Popejoy, Alice B; Asri, Mobin; Carson, Caryn; Chaisson, Mark J P; Chang, Xian; Cook-Deegan, Robert; Felsenfeld, Adam L; Fulton, Robert S; Garrison, Erik P; Garrison, Nanibaa' A; Graves-Lindsay, Tina A; Ji, Hanlee; Kenny, Eimear E; Koenig, Barbara A; Li, Daofeng; Marschall, Tobias; McMichael, Joshua F; Novak, Adam M; Purushotham, Deepak; Schneider, Valerie A; Schultz, Baergen I; Smith, Michael W; Sofia, Heidi J; Weissman, Tsachy; Flicek, Paul; Li, Heng; Miga, Karen H; Paten, Benedict; Jarvis, Erich D; Hall, Ira M; Eichler, Evan E; Haussler, David.

Nature ; 604(7906): 437-446, 2022 04.

Artigo em Inglês | MEDLINE | ID: mdl-35444317

RESUMO

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.

Assuntos

Genoma Humano , Genômica , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA

5.

KmerKeys: a web resource for searching indexed genome assemblies and variants.

Pavlichin, Dmitri S; Lee, HoJoon; Greer, Stephanie U; Grimes, Susan M; Weissman, Tsachy; Ji, Hanlee P.

Nucleic Acids Res ; 50(W1): W448-W453, 2022 07 05.

Artigo em Inglês | MEDLINE | ID: mdl-35474383

RESUMO

K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.

Assuntos

Algoritmos , Análise de Sequência de DNA , Software , Humanos , Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos

6.

Classification and clustering of RNA crosslink-ligation data reveal complex structures and homodimers.

Zhang, Minjie; Hwang, Irena T; Li, Kongpan; Bai, Jianhui; Chen, Jian-Fu; Weissman, Tsachy; Zou, James Y; Lu, Zhipeng.

Genome Res ; 32(5): 968-985, 2022 05.

Artigo em Inglês | MEDLINE | ID: mdl-35332099

RESUMO

The recent development and application of methods based on the general principle of "crosslinking and proximity ligation" (crosslink-ligation) are revolutionizing RNA structure studies in living cells. However, extracting structure information from such data presents unique challenges. Here, we introduce a set of computational tools for the systematic analysis of data from a wide variety of crosslink-ligation methods, specifically focusing on read mapping, alignment classification, and clustering. We design a new strategy to map short reads with irregular gaps at high sensitivity and specificity. Analysis of previously published data reveals distinct properties and bias caused by the crosslinking reactions. We perform rigorous and exhaustive classification of alignments and discover eight types of arrangements that provide distinct information on RNA structures and interactions. To deconvolve the dense and intertwined gapped alignments, we develop a network/graph-based tool Crosslinked RNA Secondary Structure Analysis using Network Techniques (CRSSANT), which enables clustering of gapped alignments and discovery of new alternative and dynamic conformations. We discover that multiple crosslinking and ligation events can occur on the same RNA, generating multisegment alignments to report complex high-level RNA structures and multi-RNA interactions. We find that alignments with overlapped segments are produced from potential homodimers and develop a new method for their de novo identification. Analysis of overlapping alignments revealed potential new homodimers in cellular noncoding RNAs and RNA virus genomes in the Picornaviridae family. Together, this suite of computational tools enables rapid and efficient analysis of RNA structure and interaction data in living cells.

Assuntos

RNA não Traduzido , RNA , Algoritmos , Análise por Conglomerados , RNA/química , RNA/genética , RNA não Traduzido/química , Análise de Sequência de RNA/métodos , Software

7.

Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy.

Chandak, Shubham; Tatwawadi, Kedar; Sridhar, Srivatsan; Weissman, Tsachy.

Bioinformatics ; 36(22-23): 5313-5321, 2021 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-33325499

RESUMO

MOTIVATION: Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. RESULTS: We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (â²0.2% reduction) and consensus accuracy (â²0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. AVAILABILITYAND IMPLEMENTATION: The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

8.

Denoising of Aligned Genomic Data.

Fischer-Hwang, Irena; Ochoa, Idoia; Weissman, Tsachy; Hernaez, Mikel.

Sci Rep ; 9(1): 15067, 2019 10 21.

Artigo em Inglês | MEDLINE | ID: mdl-31636330

RESUMO

Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .

Assuntos

Algoritmos , Genômica , Alinhamento de Sequência , Cromossomos Humanos/genética , Humanos

9.

SPRING: a next-generation compressor for FASTQ data.

Chandak, Shubham; Tatwawadi, Kedar; Ochoa, Idoia; Hernaez, Mikel; Weissman, Tsachy.

Bioinformatics ; 35(15): 2674-2676, 2019 08 01.

Artigo em Inglês | MEDLINE | ID: mdl-30535063

RESUMO

MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genoma Humano , Genômica , Humanos , Análise de Sequência de DNA , Software

10.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Chandak, Shubham; Tatwawadi, Kedar; Weissman, Tsachy.

Bioinformatics ; 34(4): 558-567, 2018 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-29444237

RESUMO

Motivation: New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results: We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact: schandak@stanford.edu. Supplementary information: Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.

Assuntos

Algoritmos , Compressão de Dados/métodos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Bactérias/genética , Eucariotos/genética , Genômica/métodos , Humanos , Software

11.

QVZ: lossy compression of quality values.

Malysa, Greg; Hernaez, Mikel; Ochoa, Idoia; Rao, Milind; Ganesan, Karthik; Weissman, Tsachy.

Bioinformatics ; 34(1): 179, 2018 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-29177464

12.

GeneComp, a new reference-based compressor for SAM files.

Long, Reggy; Hernaez, Mikel; Ochoa, Idoia; Weissman, Tsachy.

Proc Data Compress Conf ; 2017: 330-339, 2017 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-29046896

RESUMO

The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.

13.

Compressing Tabular Data via Pairwise Dependencies.

Pavlichin, Dmitri S; Ingber, Amir; Weissman, Tsachy.

Proc Data Compress Conf ; 2017: 455, 2017 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-29046897

14.

DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing.

Lee, Byunghan; Moon, Taesup; Yoon, Sungroh; Weissman, Tsachy.

PLoS One ; 12(7): e0181463, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28749987

RESUMO

We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.

Assuntos

Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequência de Bases , Simulação por Computador , Bases de Dados de Ácidos Nucleicos

15.

Effect of lossy compression of quality scores on variant calling.

Ochoa, Idoia; Hernaez, Mikel; Goldfeder, Rachel; Weissman, Tsachy; Ashley, Euan.

Brief Bioinform ; 18(2): 183-194, 2017 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-26966283

RESUMO

Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

Assuntos

Bases de Dados Genéticas , Algoritmos , Compressão de Dados , Genoma , Genômica , Humanos , Análise de Sequência de DNA

16.

GTRAC: fast retrieval from compressed collections of genomic variants.

Tatwawadi, Kedar; Hernaez, Mikel; Ochoa, Idoia; Weissman, Tsachy.

Bioinformatics ; 32(17): i479-i486, 2016 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-27587665

RESUMO

MOTIVATION: The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether. RESULTS: We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures. AVAILABILITY AND IMPLEMENTATION: The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Compressão de Dados , Genômica , Análise de Sequência de DNA , Bases de Dados Genéticas , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Humanos

17.

Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes.

Steiner, Fabian; Dempfle, Steffen; Ingber, Amir; Weissman, Tsachy.

IEEE Trans Inf Theory ; 62(5): 2737-2747, 2016 May.

Artigo em Inglês | MEDLINE | ID: mdl-29398721

RESUMO

We study the problem of compression for the purpose of similarity identification, where similarity is measured by the mean square Euclidean distance between vectors. While the asymptotical fundamental limits of the problem - the minimal compression rate and the error exponent - were found in a previous work, in this paper we focus on the nonasymptotic domain and on practical, implementable schemes. We first present a finite blocklength achievability bound based on shape-gain quantization: The gain (amplitude) of the vector is compressed via scalar quantization and the shape (the projection on the unit sphere) is quantized using a spherical code. The results are numerically evaluated and they converge to the asymptotic values as predicted by the error exponent. We then give a nonasymptotic lower bound on the performance of any compression scheme, and compare to the upper (achievability) bound. For a practical implementation of such a scheme, we use wrapped spherical codes, studied by Hamkins and Zeger, and use the Leech lattice as an example for an underlying lattice. As a side result, we obtain a bound on the covering angle of any wrapped spherical code, as a function of the covering radius of the underlying lattice.

18.

Minimax Rate-optimal Estimation of KL Divergence between Discrete Distributions.

Han, Yanjun; Jiao, Jiantao; Weissman, Tsachy.

Int Symp Inf Theory Appl ; 2016: 256-260, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-29457152

RESUMO

We refine the general methodology in [1] for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions with support size S comparable with the number of observations n. Specifically, we determine the "smooth" and "non-smooth" regimes based on the confidence set and the smoothness of the functional. In the "non-smooth" regime, we apply an unbiased estimator for a "suitable" polynomial approximation of the functional. In the "smooth" regime, we construct a bias corrected version of the Maximum Likelihood Estimator (MLE) based on Taylor expansion. We apply the general methodology to the problem of estimating the KL divergence between two discrete distributions from empirical data. We construct a minimax rate-optimal estimator which is adaptive in the sense that it does not require the knowledge of the support size nor the upper bound on the likelihood ratio. Moreover, the performance of the optimal estimator with n samples is essentially that of the MLE with n ln n samples, i.e., the effective sample size enlargement phenomenon holds.

19.

Rateless Lossy Compression via the Extremes.

No, Albert; Weissman, Tsachy.

IEEE Trans Inf Theory ; 62(10): 5484-5495, 2016 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-29375154

RESUMO

We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log ß n) per source symbol for ß > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.

20.

A cluster-based approach to compression of Quality Scores.

Hernaez, Mikel; Ochoa, Idoia; Weissman, Tsachy.

Proc Data Compress Conf ; 2016: 261-270, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-29057318

RESUMO

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA