ABSTRACT
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Subject(s)
Metagenomics , Metagenomics/methods , Computational Biology/methods , Metagenome , Algorithms , Genomics/methodsABSTRACT
SUMMARY: With recent advances in sequencing technologies, it is now possible to obtain near-perfect complete bacterial chromosome assemblies cheaply and efficiently by combining a long-read-first assembly approach with short-read polishing. However, existing methods for assembling bacterial plasmids from long-read-first assemblies often misassemble or even miss bacterial plasmids entirely and accordingly require manual curation. Plassembler was developed to provide a tool that automatically assembles and outputs bacterial plasmids using a hybrid assembly approach. It achieves increased accuracy and computational efficiency compared to the existing gold standard tool Unicycler by removing chromosomal reads from the input read sets using a mapping approach. AVAILABILITY AND IMPLEMENTATION: Plassembler is implemented in Python and is installable as a bioconda package using 'conda install -c bioconda plassembler'. The source code is available on GitHub at https://github.com/gbouras13/plassembler. The full benchmarking pipeline can be found at https://github.com/gbouras13/plassembler_simulation_benchmarking, while the benchmarking input FASTQ and output files can be found at https://doi.org/10.5281/zenodo.7996690.
Subject(s)
High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Plasmids/genetics , BenchmarkingABSTRACT
MOTIVATION: Microbial communities have a profound impact on both human health and various environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of challenges in viral assembly, fragmentation of genomes can occur, and existing tools may recover incomplete genome fragments. Therefore, the identification and characterization of novel phage genomes remain a challenge, leading to the need of improved approaches for phage genome recovery. RESULTS: We introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make. AVAILABILITY AND IMPLEMENTATION: Phables is available on GitHub at https://github.com/Vini2/phables.
Subject(s)
Bacteriophages , Humans , Bacteriophages/genetics , Genome, Viral , Genomics , Metagenome , Metagenomics/methods , Bacteria/geneticsABSTRACT
MOTIVATION: Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. RESULTS: We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving â¼13% improvement in F1-score and â¼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications. AVAILABILITY AND IMPLEMENTATION: The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Metagenomics , Metagenome , Sequence Analysis, DNA , SoftwareABSTRACT
MOTIVATION: The field of metagenomics has provided valuable insights into the structure, diversity and ecology within microbial communities. One key step in metagenomics analysis is to assemble reads into longer contigs which are then binned into groups of contigs that belong to different species present in the metagenomic sample. Binning of contigs plays an important role in metagenomics and most available binning algorithms bin contigs using genomic features such as oligonucleotide/k-mer composition and contig coverage. As metagenomic contigs are derived from the assembly process, they are output from the underlying assembly graph which contains valuable connectivity information between contigs that can be used for binning. RESULTS: We propose GraphBin, a new binning method that makes use of the assembly graph and applies a label propagation algorithm to refine the binning result of existing tools. We show that GraphBin can make use of the assembly graphs constructed from both the de Bruijn graph and the overlap-layout-consensus approach. Moreover, we demonstrate improved experimental results from GraphBin in terms of identifying mis-binned contigs and binning of contigs discarded by existing binning tools. To the best of our knowledge, this is the first time that the information from the assembly graph has been used in a tool for the binning of metagenomic contigs. AVAILABILITY AND IMPLEMENTATION: The source code of GraphBin is available at https://github.com/Vini2/GraphBin. CONTACT: vijini.mallawaarachchi@anu.edu.au or yu.lin@anu.edu.au. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Metagenome , Microbiota , Algorithms , Metagenomics , Sequence Analysis, DNA , SoftwareABSTRACT
BACKGROUND: Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis. RESULTS: Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample. CONCLUSIONS: Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstract.
Subject(s)
Microbiota , Humans , Databases, Protein , Molecular Sequence Annotation , Reproducibility of Results , Microbiota/genetics , Metagenome/genetics , Bacteria/genetics , Metagenomics/methodsABSTRACT
Improvements in the accuracy and availability of long-read sequencing mean that complete bacterial genomes are now routinely reconstructed using hybrid (i.e. short- and long-reads) assembly approaches. Complete genomes allow a deeper understanding of bacterial evolution and genomic variation beyond single nucleotide variants. They are also crucial for identifying plasmids, which often carry medically significant antimicrobial resistance genes. However, small plasmids are often missed or misassembled by long-read assembly algorithms. Here, we present Hybracter which allows for the fast, automatic and scalable recovery of near-perfect complete bacterial genomes using a long-read first assembly approach. Hybracter can be run either as a hybrid assembler or as a long-read only assembler. We compared Hybracter to existing automated hybrid and long-read only assembly tools using a diverse panel of samples of varying levels of long-read accuracy with manually curated ground truth reference genomes. We demonstrate that Hybracter as a hybrid assembler is more accurate and faster than the existing gold standard automated hybrid assembler Unicycler. We also show that Hybracter with long-reads only is the most accurate long-read only assembler and is comparable to hybrid methods in accurately recovering small plasmids.
Subject(s)
Algorithms , Genome, Bacterial , Software , Plasmids/genetics , Sequence Analysis, DNA/methods , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Bacteria/genetics , Bacteria/classificationABSTRACT
Improvements in the accuracy and availability of long-read sequencing mean that complete bacterial genomes are now routinely reconstructed using hybrid (i.e. short- and long-reads) assembly approaches. Complete genomes allow a deeper understanding of bacterial evolution and genomic variation beyond single nucleotide variants (SNVs). They are also crucial for identifying plasmids, which often carry medically significant antimicrobial resistance (AMR) genes. However, small plasmids are often missed or misassembled by long-read assembly algorithms. Here, we present Hybracter which allows for the fast, automatic, and scalable recovery of near-perfect complete bacterial genomes using a long-read first assembly approach. Hybracter can be run either as a hybrid assembler or as a long-read only assembler. We compared Hybracter to existing automated hybrid and long-read only assembly tools using a diverse panel of samples of varying levels of long-read accuracy with manually curated ground truth reference genomes. We demonstrate that Hybracter as a hybrid assembler is more accurate and faster than the existing gold standard automated hybrid assembler Unicycler. We also show that Hybracter with long-reads only is the most accurate long-read only assembler and is comparable to hybrid methods in accurately recovering small plasmids.
ABSTRACT
Marine host-associated microbiomes are affected by a combination of species-specific (e.g., host ancestry, genotype) and habitat-specific features (e.g., environmental physiochemistry and microbial biogeography). The stingray epidermis provides a gradient of characteristics from high dermal denticles coverage with low mucus to reduce dermal denticles and high levels of mucus. Here we investigate the effects of host phylogeny and habitat by comparing the epidermal microbiomes of Myliobatis californica (bat rays) with a mucus rich epidermis, and Urobatis halleri (round rays) with a mucus reduced epidermis from two locations, Los Angeles and San Diego, California (a 150 km distance). We found that host microbiomes are species-specific and distinct from the water column, however composition of M. californica microbiomes showed more variability between individuals compared to U. halleri. The variability in the microbiome of M. californica caused the microbial taxa to be similar across locations, while U. halleri microbiomes were distinct across locations. Despite taxonomic differences, Shannon diversity is the same across the two locations in U. halleri microbiomes suggesting the taxonomic composition are locally adapted, but diversity is maintained by the host. Myliobatis californica and U. halleri microbiomes maintain functional similarity across Los Angeles and San Diego and each ray showed several unique functional genes. Myliobatis californica has a greater relative abundance of RNA Polymerase III-like genes in the microbiome than U. halleri, suggesting specific adaptations to a heavy mucus environment. Construction of Metagenome Assembled Genomes (MAGs) identified novel microbial species within Rhodobacteraceae, Moraxellaceae, Caulobacteraceae, Alcanivoracaceae and Gammaproteobacteria. All MAGs had a high abundance of active RNA processing genes, heavy metal, and antibiotic resistant genes, suggesting the stingray mucus supports high microbial growth rates, which may drive high levels of competition within the microbiomes increasing the antimicrobial properties of the microbes.
ABSTRACT
Microbial communities influence both human health and different environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies, and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of the challenges in viral assembly, fragmentation of genomes can occur, leading to the need for new approaches in viral identification. Therefore, the identification and characterisation of novel phages remain a challenge. We introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make. Phables is available on GitHub at https://github.com/Vini2/phables.
ABSTRACT
Introduction: The special flavor and fragrance of Chinese liquor are closely related to microorganisms in the fermentation starter Daqu. The changes of microbial community can affect the stability of liquor yield and quality. Methods: In this study, we used data-independent acquisition mass spectrometry (DIA-MS) for cohort study of the microbial communities of a total of 42 Daqu samples in six production cycles at different times of a year. The DIA MS data were searched against a protein database constructed by metagenomic sequencing. Results: The microbial composition and its changes across production cycles were revealed. Functional analysis of the differential proteins was carried out and the metabolic pathways related to the differential proteins were explored. These metabolic pathways were related to the saccharification process in liquor fermentation and the synthesis of secondary metabolites to form the unique flavor and aroma in the Chinese liquor. Discussion: We expect that the metaproteome profiling of Daqu from different production cycles will serve as a guide for the control of fermentation process of Chinese liquor in the future.
ABSTRACT
The gut virome is an incredibly complex part of the gut ecosystem. Gut viruses play a role in many disease states, but it is unknown to what extent the gut virome impacts everyday human health. New experimental and bioinformatic approaches are required to address this knowledge gap. Gut virome colonization begins at birth and is considered unique and stable in adulthood. The stable virome is highly specific to each individual and is modulated by varying factors such as age, diet, disease state, and use of antibiotics. The gut virome primarily comprises bacteriophages, predominantly order Crassvirales, also referred to as crAss-like phages, in industrialized populations and other Caudoviricetes (formerly Caudovirales). The stability of the virome's regular constituents is disrupted by disease. Transferring the fecal microbiome, including its viruses, from a healthy individual can restore the functionality of the gut. It can alleviate symptoms of chronic illnesses such as colitis caused by Clostridiodes difficile. Investigation of the virome is a relatively novel field, with new genetic sequences being published at an increasing rate. A large percentage of unknown sequences, termed 'viral dark matter', is one of the significant challenges facing virologists and bioinformaticians. To address this challenge, strategies include mining publicly available viral datasets, untargeted metagenomic approaches, and utilizing cutting-edge bioinformatic tools to quantify and classify viral species. Here, we review the literature surrounding the gut virome, its establishment, its impact on human health, the methods used to investigate it, and the viral dark matter veiling our understanding of the gut virome.
ABSTRACT
Bacteroides, the prominent bacteria in the human gut, play a crucial role in degrading complex polysaccharides. Their abundance is influenced by phages belonging to the Crassvirales order. Despite identifying over 600 Crassvirales genomes computationally, only few have been successfully isolated. Continued efforts in isolation of more Crassvirales genomes can provide insights into phage-host-evolution and infection mechanisms. We focused on wastewater samples, as potential sources of phages infecting various Bacteroides hosts. Sequencing, assembly, and characterization of isolated phages revealed 14 complete genomes belonging to three novel Crassvirales species infecting Bacteroides cellulosilyticus WH2. These species, Kehishuvirus sp. 'tikkala' strain Bc01, Kolpuevirus sp. 'frurule' strain Bc03, and 'Rudgehvirus jaberico' strain Bc11, spanned two families, and three genera, displaying a broad range of virion productions. Upon testing all successfully cultured Crassvirales species and their respective bacterial hosts, we discovered that they do not exhibit co-evolutionary patterns with their bacterial hosts. Furthermore, we observed variations in gene similarity, with greater shared similarity observed within genera. However, despite belonging to different genera, the three novel species shared a unique structural gene that encodes the tail spike protein. When investigating the relationship between this gene and host interaction, we discovered evidence of purifying selection, indicating its functional importance. Moreover, our analysis demonstrated that this tail spike protein binds to the TonB-dependent receptors present on the bacterial host surface. Combining these observations, our findings provide insights into phage-host interactions and present three Crassvirales species as an ideal system for controlled infectivity experiments on one of the most dominant members of the human enteric virome. Impact statement: Bacteriophages play a crucial role in shaping microbial communities within the human gut. Among the most dominant bacteriophages in the human gut microbiome are Crassvirales phages, which infect Bacteroides. Despite being widely distributed, only a few Crassvirales genomes have been isolated, leading to a limited understanding of their biology, ecology, and evolution. This study isolated and characterized three novel Crassvirales genomes belonging to two different families, and three genera, but infecting one bacterial host, Bacteroides cellulosilyticus WH2. Notably, the observation confirmed the phages are not co-evolving with their bacterial hosts, rather have a shared ability to exploit similar features in their bacterial host. Additionally, the identification of a critical viral protein undergoing purifying selection and interacting with the bacterial receptors opens doors to targeted therapies against bacterial infections. Given Bacteroides role in polysaccharide degradation in the human gut, our findings advance our understanding of the phage-host interactions and could have important implications for the development of phage-based therapies. These discoveries may hold implications for improving gut health and metabolism to support overall well-being. Data summary: The genomes used in this research are available on Sequence Read Archive (SRA) within the project, PRJNA737576. Bacteroides cellulosilyticus WH2, Kehishuvirus sp. 'tikkala' strain Bc01, Kolpuevirus sp. ' frurule' strain Bc03, and 'Rudgehvirus jaberico' strain Bc11 are all available on GenBank with accessions NZ_CP072251.1 ( B. cellulosilyticus WH2), QQ198717 (Bc01), QQ198718 (Bc03), and QQ198719 (Bc11), and we are working on making the strains available through ATCC. The 3D protein structures for the three Crassvirales genomes are available to download at doi.org/10.25451/flinders.21946034.
ABSTRACT
Bacteroides, the prominent bacteria in the human gut, play a crucial role in degrading complex polysaccharides. Their abundance is influenced by phages belonging to the Crassvirales order. Despite identifying over 600 Crassvirales genomes computationally, only few have been successfully isolated. Continued efforts in isolation of more Crassvirales genomes can provide insights into phage-host-evolution and infection mechanisms. We focused on wastewater samples, as potential sources of phages infecting various Bacteroides hosts. Sequencing, assembly, and characterization of isolated phages revealed 14 complete genomes belonging to three novel Crassvirales species infecting Bacteroides cellulosilyticus WH2. These species, Kehishuvirus sp. 'tikkala' strain Bc01, Kolpuevirus sp. 'frurule' strain Bc03, and 'Rudgehvirus jaberico' strain Bc11, spanned two families, and three genera, displaying a broad range of virion productions. Upon testing all successfully cultured Crassvirales species and their respective bacterial hosts, we discovered that they do not exhibit co-evolutionary patterns with their bacterial hosts. Furthermore, we observed variations in gene similarity, with greater shared similarity observed within genera. However, despite belonging to different genera, the three novel species shared a unique structural gene that encodes the tail spike protein. When investigating the relationship between this gene and host interaction, we discovered evidence of purifying selection, indicating its functional importance. Moreover, our analysis demonstrated that this tail spike protein binds to the TonB-dependent receptors present on the bacterial host surface. Combining these observations, our findings provide insights into phage-host interactions and present three Crassvirales species as an ideal system for controlled infectivity experiments on one of the most dominant members of the human enteric virome.
Subject(s)
Bacteriophages , Spike Glycoprotein, Coronavirus , Humans , Bacteria , Bacteriophages/genetics , Bacteroides/geneticsABSTRACT
Metagenomics enables the recovery of various genetic materials from different species, thus providing valuable insights into microbial communities. Metagenomic binning group sequences belong to different organisms, which is an important step in the early stages of metagenomic analysis pipelines. The classic pipeline followed in metagenomic binning is to assemble short reads into longer contigs and then bin these resulting contigs into groups representing different taxonomic groups in the metagenomic sample. Most of the currently available binning tools are designed to bin metagenomic contigs, but they do not make use of the assembly graphs that produce such assemblies. In this study, we propose MetaCoAG, a metagenomic binning tool that uses assembly graphs with the composition and coverage information of contigs. MetaCoAG estimates the number of initial bins using single-copy marker genes, assigns contigs into bins iteratively, and adjusts the number of bins dynamically throughout the binning process. We show that MetaCoAG significantly outperforms state-of-the-art binning tools by producing similar or more high-quality bins than the second-best binning tool on both simulated and real datasets. To the best of our knowledge, MetaCoAG is the first stand-alone contig-binning tool that directly makes use of the assembly graph information along with other features of the contigs.
Subject(s)
Metagenomics , Microbiota , Metagenomics/methods , Metagenome/genetics , Microbiota/genetics , Algorithms , Sequence Analysis, DNA/methodsABSTRACT
Metagenomics has enabled culture-independent analysis of micro-organisms present in environmental samples. Metagenomics binning, which involves the grouping of contigs into bins that represent different taxonomic groups, is an important step of a typical metagenomic workflow followed after assembly. The majority of the metagenomic binning tools represent the composition and coverage information of contigs as feature vectors consisting of a large number of dimensions. However, these tools use traditional Euclidean distance or Manhattan distance metrics which become unreliable in the high dimensional space. We propose CH-Bin, a binning approach that leverages the benefits of using convex hull distance for binning contigs represented by high dimensional feature vectors. We demonstrate using experimental evidence on simulated and real datasets that the use of high dimensional feature vectors to represent contigs can preserve additional information, and result in improved binning results. We further demonstrate that the convex hull distance based binning approach can be effectively utilized in binning such high dimensional data. To the best of our knowledge, this is the first time that composition information from oligonucleotides of multiple sizes has been used in representing the composition information of contigs and a convex hull distance based binning algorithm has been used to bin metagenomic contigs. The source code of CH-Bin is available at https://github.com/kdsuneraavinash/CH-Bin.
Subject(s)
Metagenome , Metagenomics , Algorithms , Metagenomics/methods , Sequence Analysis, DNA/methods , SoftwareABSTRACT
Introduction: Daqu, the Chinese liquor fermentation starter, contains complex microbial communities that are important for the yield, quality, and unique flavor of produced liquor. However, the composition and metabolism of microbial communities in the different types of high-temperature Daqu (i.e., white, yellow, and black Daqu) have not been well understood. Methods: Herein, we used quantitative metaproteomics based on data-independent acquisition (DIA) mass spectrometry to analyze a total of 90 samples of white, yellow, and black Daqu collected in spring, summer, and autumn, revealing the taxonomic and metabolic profiles of different types of Daqu across seasons. Results: Taxonomic composition differences were explored across types of Daqu and seasons, where the under-fermented white Daqu showed the higher microbial diversity and seasonal stability. It was demonstrated that yellow Daqu had higher abundance of saccharifying enzymes for raw material degradation. In addition, considerable seasonal variation of microbial protein abundance was discovered in the over-fermented black Daqu, suggesting elevated carbohydrate and amino acid metabolism in autumn black Daqu. Discussion: We expect that this study will facilitate the understanding of the key microbes and their metabolism in the traditional fermentation process of Chinese liquor production.
ABSTRACT
BACKGROUND: Metagenomic sequencing allows us to study the structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for binning contigs only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). RESULTS: In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins. CONCLUSION: GraphBin2 incorporates the coverage information into the assembly graph to refine the binning results obtained from existing binning tools. GraphBin2 also enables the detection of contigs that may belong to multiple species. We show that GraphBin2 outperforms its predecessor GraphBin on both simulated and real datasets. GraphBin2 is freely available at https://github.com/Vini2/GraphBin2 .
ABSTRACT
Bioinformatics research continues to advance at an increasing scale with the help of techniques such as next-generation sequencing and the availability of tool support to automate bioinformatics processes. With this growth, a large amount of biological data gets accumulated at an unprecedented rate, demanding high-performance and high-throughput computing technologies for processing such datasets. Use of hardware accelerators, such as graphics processing units (GPUs) and distributed computing, accelerates the processing of big data in high-performance computing environments. They enable higher degrees of parallelism to be achieved, thereby increasing the throughput. In this paper, we introduce BioWorkflow, an interactive workflow management system to automate the bioinformatics analyses with the capability of scheduling parallel tasks with the use of GPU-accelerated and distributed computing. This paper describes a case study carried out to evaluate the performance of a complex workflow with branching executed by BioWorkflow. The results indicate the gains of $\times 2.89$ magnitude by utilizing GPUs and gains in speed by average $\times 2.832$ magnitude (over $n = 5$ scenarios) by parallel execution of graph nodes during multiple sequence alignment calculations. Combined speed-ups are achieved $\times 1.71$ times for complex workflows. This confirms the expected higher speed-ups when having parallelism through GPU-acceleration and concurrent execution of workflow nodes than the mainstream sequential workflow execution. The tool also provides a comprehensive user interface with better interactivity for managing complex workflows; a system usability scale score of 82.9 is confirmed high usability for the system.