How to optimally sample a sequence for rapid analysis.
Bioinformatics
; 39(2)2023 02 03.
Article
in En
| MEDLINE
| ID: mdl-36702468
MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Full text:
1
Collection:
01-internacional
Database:
MEDLINE
Main subject:
Algorithms
/
Software
Language:
En
Journal:
Bioinformatics
Journal subject:
INFORMATICA MEDICA
Year:
2023
Document type:
Article
Affiliation country:
Japan
Country of publication:
United kingdom