REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.

Marchet, Camille; Iqbal, Zamin; Gautheret, Daniel; Salson, Mikaël; Chikhi, Rayan

Marchet, Camille; Iqbal, Zamin; Gautheret, Daniel; Salson, Mikaël; Chikhi, Rayan.

Afiliação

Marchet C; CNRS, UMR 9189 - CRIStAL, Université de Lille, F-59000 Lille, France.
Iqbal Z; European Bioinformatics Institute, Cambridge CB10 1SD, UK.
Gautheret D; CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette 91190, France.
Salson M; CNRS, UMR 9189 - CRIStAL, Université de Lille, F-59000 Lille, France.
Chikhi R; Institut Pasteur, CNRS, C3BI - USR 3756, 75015 Paris, France.

Bioinformatics ; 36(Suppl_1): i177-i185, 2020 07 01.

Article em En | MEDLINE | ID: mdl-32657392

ABSTRACT

ABSTRACT

MOTIVATION In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.

RESULTS:

We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of â¼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION https//github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Análise de Sequência de DNA; Software; Algoritmos; Humanos; Análise de Sequência de RNA

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Análise de Sequência de DNA Limite: Humans Idioma: En Ano de publicação: 2020 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google