Syllable-PBWT for space-efficient haplotype long-match query.
Bioinformatics
; 39(1)2023 01 01.
Article
em En
| MEDLINE
| ID: mdl-36440908
ABSTRACT
MOTIVATION The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. RESULTS:
Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. AVAILABILITY AND IMPLEMENTATION https//github.com/ZhiGroup/Syllable-PBWT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Texto completo:
1
Coleções:
01-internacional
Base de dados:
MEDLINE
Assunto principal:
Algoritmos
/
Genoma
Idioma:
En
Revista:
Bioinformatics
Assunto da revista:
INFORMATICA MEDICA
Ano de publicação:
2023
Tipo de documento:
Article
País de afiliação:
Estados Unidos