Prokrustean Graph: A substring index supporting rapid enumeration across a range of k-mer sizes.

Park, Adam; Koslicki, David

Park, Adam; Koslicki, David.

Afiliação

Park A; Department of Computer Science and Engineering, Pennsylvania State University, PA, USA.
Koslicki D; Department of Computer Science and Engineering, Department of Biology, and the Huck Institutes of the Life Sciences, Pennsylvania State University, PA, USA.

bioRxiv ; 2024 Jun 01.

Article em En | MEDLINE | ID: mdl-38853857

ABSTRACT

ABSTRACT

Despite the widespread adoption of k -mer-based methods in bioinformatics, a fundamental question persists How can we quantify the influence of k sizes in applications? With no universal answer available, choosing an optimal k size or employing multiple k sizes remains application-specific, arbitrary, and computationally expensive. The assessment of the primary parameter k is typically empirical, based on the end products of applications which pass complex processes of genome analysis, comparison, assembly, alignment, and error correction. The elusiveness of the problem stems from a limited understanding of the transitions of k -mers with respect to k sizes. Indeed, there is considerable room for improving both practice and theory by exploring k -mer-specific quantities across multiple k sizes. This paper introduces an algorithmic framework built upon a novel substring representation the Prokrustean graph. The primary functionality of this framework is to extract various k -mer-based quantities across a range of k sizes, but its computational complexity depends only on maximal repeats, not on the k range. For example, counting maximal unitigs of de Bruijn graphs for k = 10 , , 100 takes just a few seconds with a Prokrustean graph built on a read set of gigabases in size. This efficiency sets the graph apart from other substring indices, such as the FM-index, which are normally optimized for string pattern searching rather than for depicting the substring structure across varying lengths. However, the Prokrustean graph is expected to close this gap, as it can be built using the extended Burrows-Wheeler Transform (eBWT) in a space-efficient manner. The framework is particularly useful in pangenome and metagenome analyses, where the demand for precise multi- k approaches is increasing due to the complex and diverse nature of the information being managed. We introduce four applications implemented with the framework that extract key quantities actively utilized in modern pangenomics and metagenomics.

Palavras-chave

Applied computing â Computational biology; BWT; FM-index; genome assembly; k-mer; k-mer spectra; metagenomics; pangenomics

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links