Your browser doesn't support javascript.
loading
Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.
Khan, Jamshed; Patro, Rob.
Afiliação
  • Khan J; Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
  • Patro R; Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.
Bioinformatics ; 37(Suppl_1): i177-i186, 2021 07 12.
Article em En | MEDLINE | ID: mdl-34252958
ABSTRACT
MOTIVATION The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.

RESULTS:

We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata's state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. AVAILABILITY AND IMPLEMENTATION Cuttlefish is implemented in C++14, and is available under an open source license at https//github.com/COMBINE-lab/cuttlefish. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Assuntos

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Decapodiformes / Genômica Limite: Animals / Humans Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Decapodiformes / Genômica Limite: Animals / Humans Idioma: En Ano de publicação: 2021 Tipo de documento: Article