Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Mustafa, Harun; Karasikov, Mikhail; Mansouri Ghiasi, Nika; Rätsch, Gunnar; Kahles, André

Mustafa, Harun; Karasikov, Mikhail; Mansouri Ghiasi, Nika; Rätsch, Gunnar; Kahles, André.

Afiliação

Mustafa H; Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland.
Karasikov M; Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland.
Mansouri Ghiasi N; Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland.
Rätsch G; Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland.
Kahles A; Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland.

Bioinformatics ; 40(Suppl 1): i337-i346, 2024 06 28.

Article em En | MEDLINE | ID: mdl-38940164

ABSTRACT

ABSTRACT

MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

RESULTS:

We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key

steps:

a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https//github.com/ratschlab/mla.

Assuntos

Algoritmos; Alinhamento de Sequência; Alinhamento de Sequência/métodos; Software; Biologia Computacional/métodos; Análise de Sequência de DNA/métodos; Bases de Dados Genéticas

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Alinhamento de Sequência Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Suíça

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google