Your browser doesn't support javascript.
loading
Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.
Yi, Huiguang; Lin, Yanling; Lin, Chengqi; Jin, Wenfei.
Affiliation
  • Yi H; Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, 518055, Guangdong, China.
  • Lin Y; Institute of Life Sciences, Southeast University, Nanjing, 210096, Jiangsu, China.
  • Lin C; Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, 518055, Guangdong, China.
  • Jin W; Institute of Life Sciences, Southeast University, Nanjing, 210096, Jiangsu, China.
Genome Biol ; 22(1): 84, 2021 03 16.
Article in En | MEDLINE | ID: mdl-33726811
ABSTRACT
Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.
Subject(s)
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Software / Computational Biology / Metagenomics Language: En Journal: Genome Biol Journal subject: BIOLOGIA MOLECULAR / GENETICA Year: 2021 Document type: Article Affiliation country: China

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Software / Computational Biology / Metagenomics Language: En Journal: Genome Biol Journal subject: BIOLOGIA MOLECULAR / GENETICA Year: 2021 Document type: Article Affiliation country: China