Designing efficient randstrobes for sequence similarity analyses.

Karami, Moein; Soltani Mohammadi, Aryan; Martin, Marcel; Ekim, Baris; Shen, Wei; Guo, Lidong; Xu, Mengyang; Pibiri, Giulio Ermanno; Patro, Rob; Sahlin, Kristoffer

Karami, Moein; Soltani Mohammadi, Aryan; Martin, Marcel; Ekim, Baris; Shen, Wei; Guo, Lidong; Xu, Mengyang; Pibiri, Giulio Ermanno; Patro, Rob; Sahlin, Kristoffer.

Affiliation

Karami M; Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden.
Soltani Mohammadi A; Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden.
Martin M; Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Solna SE-17121, Sweden.
Ekim B; Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, United States.
Shen W; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.
Guo L; Department of Infectious Diseases, Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China.
Xu M; BGI Research, Qingdao 266555, China.
Pibiri GE; BGI Research, Shenzhen 518083, China.
Patro R; Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University of Venice, Venice 30172, Italy.
Sahlin K; ISTI-CNR, Pisa 56124, Italy.

Bioinformatics ; 40(4)2024 Mar 29.

Article in En | MEDLINE | ID: mdl-38579261

ABSTRACT

ABSTRACT

MOTIVATION Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;312080-94. https//doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy.

RESULTS:

In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. AVAILABILITY AND IMPLEMENTATION All methods and evaluation benchmarks are available in a public Github repository at https//github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https//github.com/NBISweden/strobealign-evaluation.

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Database: MEDLINE Language: En Year: 2024 Type: Article

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Database: MEDLINE Language: En Year: 2024 Type: Article