RESUMO
With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento por Nanoporos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma , Software , AlgoritmosRESUMO
MOTIVATION: Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS: For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION: The invMap software is available at https://github.com/zhang134/invMap.git.
Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Genoma Humano , Inversão Cromossômica , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: Single Molecule Sequencing (SMS) technology can produce longer reads with higher sequencing error rate. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. Most existing mapping tools generally adopt the traditional seed-and-extend strategy, and the candidate aligned regions for each query read are selected either by counting the number of matched seeds or chaining a group of seeds. However, for all the existing mapping tools, the coverage ratio of the alignment region to the query read is lower, and the read alignment quality and efficiency need to be improved. Here, we introduce smsMap, a novel mapping tool that is specifically designed to map the long reads of SMS to a reference genome. RESULTS: smsMap was evaluated with other existing seven SMS mapping tools (e.g., BLASR, minimap2, and BWA-MEM) on both simulated and real-life SMS datasets. The experimental results show that smsMap can efficiently achieve higher aligned read coverage ratio and has higher sensitivity that can align more sequences and bases to the reference genome. Additionally, smsMap is more robust to sequencing errors. CONCLUSIONS: smsMap is computationally efficient to align SMS reads, especially for the larger size of the reference genome (e.g., H. sapiens genome with over 3 billion base pairs). The source code of smsMap can be freely downloaded from https://github.com/NWPU-903PR/smsMap .
Assuntos
Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Software , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Escherichia coli/genética , Humanos , Fatores de TempoRESUMO
BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio's continuous long reads (CLR). RESULTS: By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. CONCLUSION: NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets.
Assuntos
Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , HumanosRESUMO
Recent sequencing revolution driven by high-throughput technologies has led to rapid accumulation of 16S rRNA sequences for microbial communities. Clustering short sequences into operational taxonomic units (OTUs) is an initial crucial process in analyzing metagenomic data. Although many heuristic methods have been proposed for OTU inferences with low computational complexity, they just select one sequence as the seed for each cluster and the results are sensitive to the selected sequences that represent the clusters. To address this issue, we present a de Bruijn graph-based heuristic clustering method (DBH) for clustering massive 16S rRNA sequences into OTUs by introducing a novel seed selection strategy and greedy clustering approach. Compared with existing widely used methods on several simulated and real-life metagenomic datasets, the results show that DBH has higher clustering performance and low memory usage, facilitating the overestimation of OTUs number. DBH is more effective to handle large-scale metagenomic datasets. The DBH software can be freely downloaded from https://github.com/nwpu134/DBH.git for academic users.
Assuntos
Heurística , Metagenômica/métodos , RNA Ribossômico 16S/genética , Algoritmos , Análise por Conglomerados , Biologia Computacional/métodos , Microbioma Gastrointestinal/genética , Humanos , Filogenia , RNA Bacteriano/genética , Análise de Sequência de DNA/métodosRESUMO
The inference of gene regulatory networks (GRNs) from expression data can mine the direct regulations among genes and gain deep insights into biological processes at a network level. During past decades, numerous computational approaches have been introduced for inferring the GRNs. However, many of them still suffer from various problems, e.g., Bayesian network (BN) methods cannot handle large-scale networks due to their high computational complexity, while information theory-based methods cannot identify the directions of regulatory interactions and also suffer from false positive/negative problems. To overcome the limitations, in this work we present a novel algorithm, namely local Bayesian network (LBN), to infer GRNs from gene expression data by using the network decomposition strategy and false-positive edge elimination scheme. Specifically, LBN algorithm first uses conditional mutual information (CMI) to construct an initial network or GRN, which is decomposed into a number of local networks or GRNs. Then, BN method is employed to generate a series of local BNs by selecting the k-nearest neighbors of each gene as its candidate regulatory genes, which significantly reduces the exponential search space from all possible GRN structures. Integrating these local BNs forms a tentative network or GRN by performing CMI, which reduces redundant regulations in the GRN and thus alleviates the false positive problem. The final network or GRN can be obtained by iteratively performing CMI and local BN on the tentative network. In the iterative process, the false or redundant regulations are gradually removed. When tested on the benchmark GRN datasets from DREAM challenge as well as the SOS DNA repair network in E.coli, our results suggest that LBN outperforms other state-of-the-art methods (ARACNE, GENIE3 and NARROMI) significantly, with more accurate and robust performance. In particular, the decomposition strategy with local Bayesian networks not only effectively reduce the computational cost of BN due to much smaller sizes of local GRNs, but also identify the directions of the regulations.
Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes/genética , Modelos Genéticos , Algoritmos , Teorema de Bayes , Escherichia coli/genética , Escherichia coli/metabolismo , Perfilação da Expressão GênicaRESUMO
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
Assuntos
Algoritmos , Genômica , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.
Assuntos
Heurística , Software , Algoritmos , Análise por Conglomerados , Alinhamento de SequênciaRESUMO
With the rapid development of single molecular sequencing (SMS) technologies such as PacBio single-molecule real-time and Oxford Nanopore sequencing, the output read length is continuously increasing, which has dramatical potentials on cutting-edge genomic applications. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. However, these long reads contain higher sequencing errors and could more frequently span the breakpoints of structural variants (SVs) than those of shorter reads, leading to many unaligned reads or reads that are partially aligned for most state-of-the-art mappers. As a result, these methods usually focus on producing local mapping results for the query read rather than obtaining the whole end-to-end alignment. We introduce kngMap, a novel k-mer neighborhood graph-based mapper that is specifically designed to align long noisy SMS reads to a reference sequence. By benchmarking exhaustive experiments on both simulated and real-life SMS datasets to assess the performance of kngMap with ten other popular SMS mapping tools (e.g., BLASR, BWA-MEM, and minimap2), we demonstrated that kngMap has higher sensitivity that can align more reads and bases to the reference genome; meanwhile, kngMap can produce consecutive alignments for the whole read and span different categories of SVs in the reads. kngMap is implemented in C++ and supports multi-threading; the source code of kngMap can be downloaded for free at: https://github.com/zhang134/kngMap for academic usage.
RESUMO
With the advent of next-generation sequencing technology, it has become convenient and cost efficient to thoroughly characterize the microbial diversity and taxonomic composition in various environmental samples. Millions of sequencing data can be generated, and how to utilize this enormous sequence resource has become a critical concern for microbial ecologists. One particular challenge is the OTUs (operational taxonomic units) picking in 16S rRNA sequence analysis. Lucky, this challenge can be directly addressed by sequence clustering that attempts to group similar sequences. Therefore, numerous clustering methods have been proposed to help to cluster 16S rRNA sequences into OTUs. However, each method has its clustering mechanism, and different methods produce diverse outputs. Even a slight parameter change for the same method can also generate distinct results, and how to choose an appropriate method has become a challenge for inexperienced users. A lot of time and resources can be wasted in selecting clustering tools and analyzing the clustering results. In this study, we introduced the recent advance of clustering methods for OTUs picking, which mainly focus on three aspects: (i) the principles of existing clustering algorithms, (ii) benchmark dataset construction for OTU picking and evaluation metrics, and (iii) the performance of different methods with various distance thresholds on benchmark datasets. This paper aims to assist biological researchers to select the reasonable clustering methods for analyzing their collected sequences and help algorithm developers to design more efficient sequences clustering methods.
RESUMO
Next-generation sequencing (NGS)-based 16S rRNA sequencing by jointly using the PCR amplification and NGS technology is a cost-effective technique, which has been successfully used to study the phylogeny and taxonomy of samples from complex microbiomes or environments. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is often the first step for many downstream analyses. Heuristic clustering is one of the most widely employed approaches for generating OTUs. However, most heuristic OTUs clustering methods just select one single seed sequence to represent each cluster, resulting in their outcomes suffer from either overestimation of OTUs number or sensitivity to sequencing errors. In this paper, we present a novel dynamic multi-seeds clustering method (namely DMSC) to pick OTUs. DMSC first heuristically generates clusters according to the distance threshold. When the size of a cluster reaches the pre-defined minimum size, then DMSC selects the multi-core sequences (MCS) as the seeds that are defined as the n-core sequences (n ≥ 3), in which the distance between any two sequences is less than the distance threshold. A new sequence is assigned to the corresponding cluster depending on the average distance to MCS and the distance standard deviation within the MCS. If a new sequence is added to the cluster, dynamically update the MCS until no sequence is merged into the cluster. The new method DMSC was tested on several simulated and real-life sequence datasets and also compared with the traditional heuristic methods such as CD-HIT, UCLUST, and DBH. Experimental results in terms of the inferred OTUs number, normalized mutual information (NMI) and Matthew correlation coefficient (MCC) metrics demonstrate that DMSC can produce higher quality clusters with low memory usage and reduce OTU overestimation. Additionally, DMSC is also robust to the sequencing errors. The DMSC software can be freely downloaded from https://github.com/NWPU-903PR/DMSC.
RESUMO
Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is a crucial step in analyzing metagenomic data. Although many methods have been developed, how to obtain an appropriate balance between clustering accuracy and computational efficiency is still a major challenge. A novel density-based modularity clustering method, called DMclust, is proposed in this paper to bin 16S rRNA sequences into OTUs with high clustering accuracy. The DMclust algorithm consists of four main phases. It first searches for the sequence dense group defined as n-sequence community, in which the distance between any two sequences is less than a threshold. Then these dense groups are used to construct a weighted network, where dense groups are viewed as nodes, each pair of dense groups is connected by an edge, and the distance of pairwise groups represents the weight of the edge. Then, a modularity-based community detection method is employed to generate the preclusters. Finally, the remaining sequences are assigned to their nearest preclusters to form OTUs. Compared with existing widely used methods, the experimental results on several metagenomic datasets show that DMclust has higher accurate clustering performance with acceptable memory usage.
Assuntos
Algoritmos , Biologia Computacional/métodos , RNA Ribossômico 16S/genética , Análise de Sequência de DNA/métodos , Análise por Conglomerados , Humanos , Conformação de Ácido NucleicoRESUMO
Protein-protein interactions (PPIs) play a key role in many cellular processes. Uncovering the PPIs and their function within the cell is a challenge of post-genomic biology and will improve our understanding of disease and help in the development of novel methods for disease diagnosis and forensics. The experimental methods currently used to identify PPIs are both time-consuming and expensive, and high throughput experimental results have shown both high false positive beside false negative information for protein interaction. These obstacles could be overcome by developing computational approaches to predict PPIs and validate the obtained experimental results. In this work, we will describe the recent advances in predicting protein-protein interaction from the following aspects: i) the benchmark dataset construction, ii) the sequence representation approaches, iii) the common machine learning algorithms, and iv) the cross-validation test methods and assessment metrics.
Assuntos
Aprendizado de Máquina , Proteínas/química , Algoritmos , Ensaios de Triagem em Larga Escala , Ligação Proteica , Proteínas/genéticaRESUMO
The recent sequencing revolution driven by high-throughput technologies has led to rapid accumulation of 16S rRNA sequences for microbial communities. Clustering short sequences into operational taxonomic units (OTUs) is an initial crucial process in analyzing metagenomic data. Although many methods have been proposed for OTU inferences, a major challenge is the balance between inference accuracy and computational efficiency. To address these challenges, we present a novel motif-based hierarchical method (namely MtHc) for clustering massive 16S rRNA sequences into OTUs with high clustering accuracy and low memory usage. Suppose all the 16S rRNA sequences can be used to construct a complete weighted network, where sequences are viewed as nodes, each pair of sequences is connected by an imaginary edge, and the distance of a pair of sequences represents the weight of the edge. MtHc consists of three main phrases. First, heuristically search the motif that is defined as n-node sub-graph (in the present study, n = 3, 4, 5), in which the distance between any two nodes is less than a threshold. Second, use the motif as a seed to form candidate clusters by computing the distances of other sequences with the motif. Finally, hierarchically merge the candidate clusters to generate the OTUs by only calculating the distances of motifs between two clusters. Compared with the existing methods on several simulated and real-life metagenomic datasets, we demonstrate that MtHc has higher clustering performance, less memory usage and robustness for setting parameters, and that it is more effective to handle the large-scale metagenomic datasets. The MtHC software can be freely download from for academic users.
Assuntos
Microbioma Gastrointestinal/genética , Tipagem Molecular/métodos , RNA Ribossômico 16S/genética , Algoritmos , Análise por Conglomerados , Humanos , RNA Bacteriano/genética , Análise de Sequência de RNARESUMO
With the development of high-throughput and low-cost sequencing technology, a large number of marine microbial sequences were generated. The association patterns between marine microbial species and environment factors are hidden in these large amount sequences. Mining these association patterns is beneficial to exploit the marine resources. However, very few marine microbial association patterns are well investigated in this field. The present study reports the development of a novel method called HC-sNMF to detect the marine microbial association patterns. The results show that the four seasonal marine microbial association networks have characters of complex networks, the same environmental factor influences different species in the four seasons, and the correlative relationships are stronger between OTUs (taxa) than with environmental factors in the four seasons detecting community.