Your browser doesn't support javascript.
loading
Lossless indexing with counting de Bruijn graphs.
Karasikov, Mikhail; Mustafa, Harun; Rätsch, Gunnar; Kahles, André.
Afiliación
  • Karasikov M; Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland.
  • Mustafa H; Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland.
  • Rätsch G; Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
  • Kahles A; Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland.
Genome Res ; 2022 May 24.
Article en En | MEDLINE | ID: mdl-35609994
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: Genome Res Asunto de la revista: BIOLOGIA MOLECULAR / GENETICA Año: 2022 Tipo del documento: Article País de afiliación: Suiza

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: Genome Res Asunto de la revista: BIOLOGIA MOLECULAR / GENETICA Año: 2022 Tipo del documento: Article País de afiliación: Suiza