RESUMO
With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento por Nanoporos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma , Software , AlgoritmosRESUMO
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
Assuntos
Algoritmos , Genômica , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
With the rapid development of single molecular sequencing (SMS) technologies such as PacBio single-molecule real-time and Oxford Nanopore sequencing, the output read length is continuously increasing, which has dramatical potentials on cutting-edge genomic applications. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. However, these long reads contain higher sequencing errors and could more frequently span the breakpoints of structural variants (SVs) than those of shorter reads, leading to many unaligned reads or reads that are partially aligned for most state-of-the-art mappers. As a result, these methods usually focus on producing local mapping results for the query read rather than obtaining the whole end-to-end alignment. We introduce kngMap, a novel k-mer neighborhood graph-based mapper that is specifically designed to align long noisy SMS reads to a reference sequence. By benchmarking exhaustive experiments on both simulated and real-life SMS datasets to assess the performance of kngMap with ten other popular SMS mapping tools (e.g., BLASR, BWA-MEM, and minimap2), we demonstrated that kngMap has higher sensitivity that can align more reads and bases to the reference genome; meanwhile, kngMap can produce consecutive alignments for the whole read and span different categories of SVs in the reads. kngMap is implemented in C++ and supports multi-threading; the source code of kngMap can be downloaded for free at: https://github.com/zhang134/kngMap for academic usage.