Your browser doesn't support javascript.
loading
How to optimally sample a sequence for rapid analysis.
Frith, Martin C; Shaw, Jim; Spouge, John L.
Affiliation
  • Frith MC; Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan.
  • Shaw J; Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan.
  • Spouge JL; Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo 169-8555, Japan.
Bioinformatics ; 39(2)2023 02 03.
Article in En | MEDLINE | ID: mdl-36702468
MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Algorithms / Software Language: En Journal: Bioinformatics Journal subject: INFORMATICA MEDICA Year: 2023 Document type: Article Affiliation country: Japan Country of publication: United kingdom

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Algorithms / Software Language: En Journal: Bioinformatics Journal subject: INFORMATICA MEDICA Year: 2023 Document type: Article Affiliation country: Japan Country of publication: United kingdom