RESUMO
BACKGROUND: Strain-level RNA virus characterization is essential for developing prevention and treatment strategies. Viral metagenomic data, which can contain sequences of both known and novel viruses, provide new opportunities for characterizing RNA viruses. Although there are a number of pipelines for analyzing viruses in metagenomic data, they have different limitations. First, viruses that lack closely related reference genomes cannot be detected with high sensitivity. Second, strain-level analysis is usually missing. RESULTS: In this study, we developed a hybrid pipeline named TAR-VIR that reconstructs viral strains without relying on complete or high-quality reference genomes. It is optimized for identifying RNA viruses from metagenomic data by combining an effective read classification method and our in-house strain-level de novo assembly tool. TAR-VIR was tested on both simulated and real viral metagenomic data sets. The results demonstrated that TAR-VIR competes favorably with other tested tools. CONCLUSION: TAR-VIR can be used standalone for viral strain reconstruction from metagenomic data. Or, its read recruiting stage can be used with other de novo assembly tools for superior viral functional and taxonomic analyses. The source code and the documentation of TAR-VIR are available at https://github.com/chjiao/TAR-VIR .
Assuntos
Vírus de RNA/genética , Software , Humanos , Metagenômica/métodos , Vírus de RNA/classificação , Análise de Sequência de RNARESUMO
Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most methods that are currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads), the performance is often poor. One reason is that the reads in a sample can be very different from the corresponding reference genomes; for example, viral genomes are usually highly mutated. To address this issue, in this article, we propose ClassGraph, a new taxonomic classification method that makes use of the read overlap graph and applies a label propagation algorithm to refine the results of existing tools. We evaluated its performance on simulated and real datasets with several taxonomic classification tools, and the results showed an improved sensitivity and F-measure, while maintaining high precision. ClassGraph is capable of improving the classification accuracy, especially in difficult cases such as virus and real datasets, where traditional tools can classify <40% of reads.
Assuntos
Algoritmos , Microbiota , Análise de Sequência de DNA , Metagenoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodosRESUMO
Background: Ongoing research of the mosquito microbiome aims to uncover novel strategies to reduce pathogen transmission. Sequencing costs, especially for metagenomics, are however still significant. A resource that is increasingly used to gain insights into host-associated microbiomes is the large amount of publicly available genomic data based on whole organisms like mosquitoes, which includes sequencing reads of the host-associated microbes and provides the opportunity to gain additional value from these initially host-focused sequencing projects. Methods: To analyse non-host reads from existing genomic data, we developed a snakemake workflow called MINUUR (Microbial INsights Using Unmapped Reads). Within MINUUR, reads derived from the host-associated microbiome were extracted and characterised using taxonomic classifications and metagenome assembly followed by binning and quality assessment. We applied this pipeline to five publicly available Aedes aegypti genomic datasets, consisting of 62 samples with a broad range of sequencing depths. Results: We demonstrate that MINUUR recovers previously identified phyla and genera and is able to extract bacterial metagenome assembled genomes (MAGs) associated to the microbiome. Of these MAGS, 42 are high-quality representatives with >90% completeness and <5% contamination. These MAGs improve the genomic representation of the mosquito microbiome and can be used to facilitate genomic investigation of key genes of interest. Furthermore, we show that samples with a high number of KRAKEN2 assigned reads produce more MAGs. Conclusions: Our metagenomics workflow, MINUUR, was applied to a range of Aedes aegypti genomic samples to characterise microbiome-associated reads. We confirm the presence of key mosquito-associated symbionts that have previously been identified in other studies and recovered high-quality bacterial MAGs. In addition, MINUUR and its associated documentation are freely available on GitHub and provide researchers with a convenient workflow to investigate microbiome data included in the sequencing data for any applicable host genome of interest.
RESUMO
Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a reference database. The program DL-TODA presented here aims to classify metagenomic reads using a deep learning model trained on over 3000 bacterial species. A convolutional neural network architecture originally designed for computer vision was applied for the modeling of species-specific features. Using synthetic testing data simulated with 2454 genomes from 639 species, DL-TODA was shown to classify nearly 75% of the reads with high confidence. The classification accuracy of DL-TODA was over 0.98 at taxonomic ranks above the genus level, making it comparable with Kraken2 and Centrifuge, two state-of-the-art taxonomic classification tools. DL-TODA also achieved an accuracy of 0.97 at the species level, which is higher than 0.93 by Kraken2 and 0.85 by Centrifuge on the same test set. Application of DL-TODA to the human oral and cropland soil metagenomes further demonstrated its use in analyzing microbiomes from diverse environments. Compared to Centrifuge and Kraken2, DL-TODA predicted distinct relative abundance rankings and is less biased toward a single taxon.
Assuntos
Aprendizado Profundo , Microbiota , Humanos , Redes Neurais de Computação , Bactérias/genética , Metagenoma , Microbiota/genética , AlgoritmosRESUMO
Microbiome analysis is quickly moving towards high-throughput methods such as metagenomic sequencing. Accurate taxonomic classification of metagenomic data relies on reference sequence databases, and their associated taxonomy. However, for understudied environments such as the rumen microbiome many sequences will be derived from novel or uncultured microbes that are not present in reference databases. As a result, taxonomic classification of metagenomic data from understudied environments may be inaccurate. To assess the accuracy of taxonomic read classification, this study classified metagenomic data that had been simulated from cultured rumen microbial genomes from the Hungate collection. To assess the impact of reference databases on the accuracy of taxonomic classification, the data was classified with Kraken 2 using several reference databases. We found that the choice and composition of reference database significantly impacted on taxonomic classification results, and accuracy. In particular, NCBI RefSeq proved to be a poor choice of database. Our results indicate that inaccurate read classification is likely to be a significant problem, affecting all studies that use insufficient reference databases. We observed that adding cultured reference genomes from the rumen to the reference database greatly improved classification rate and accuracy. We also demonstrated that metagenome-assembled genomes (MAGs) have the potential to further enhance classification accuracy by representing uncultivated microbes, sequences of which would otherwise be unclassified or incorrectly classified. However, classification accuracy was strongly dependent on the taxonomic labels assigned to these MAGs. We therefore highlight the importance of accurate reference taxonomic information and suggest that, with formal taxonomic lineages, MAGs have the potential to improve classification rate and accuracy, particularly in environments such as the rumen that are understudied or contain many novel genomes.
RESUMO
BACKGROUND: Widespread bioinformatic resource development generates a constantly evolving and abundant landscape of workflows and software. For analysis of the microbiome, workflows typically begin with taxonomic classification of the microorganisms that are present in a given environment. Additional investigation is then required to uncover the functionality of the microbial community, in order to characterize its currently or potentially active biological processes. Such functional analysis of metagenomic data can be computationally demanding for high-throughput sequencing experiments. Instead, we can directly compare sequencing reads to a functionally annotated database. However, since reads frequently match multiple sequences equally well, analyses benefit from a hierarchical annotation tree, e.g. for taxonomic classification where reads are assigned to the lowest taxonomic unit. RESULTS: To facilitate functional microbiome analysis, we re-purpose well-known taxonomic classification tools to allow us to perform direct functional sequencing read classification with the added benefit of a functional hierarchy. To enable this, we develop and present a tree-shaped functional hierarchy representing the molecular function subset of the Gene Ontology annotation structure. We use this functional hierarchy to replace the standard phylogenetic taxonomy used by the classification tools and assign query sequences accurately to the lowest possible molecular function in the tree. We demonstrate this with simulated and experimental datasets, where we reveal new biological insights. CONCLUSIONS: We demonstrate that improved functional classification of metagenomic sequencing reads is possible by re-purposing a range of taxonomic classification tools that are already well-established, in conjunction with either protein or nucleotide reference databases. We leverage the advances in speed, accuracy and efficiency that have been made for taxonomic classification and translate these benefits for the rapid functional classification of microbiomes. While we focus on a specific set of commonly used methods, the functional annotation approach has broad applicability across other sequence classification tools. We hope that re-purposing becomes a routine consideration during bioinformatic resource development. Video abstract.
Assuntos
Classificação/métodos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Metagenoma/genética , Metagenômica/métodos , Microbiota/genética , Software , FilogeniaRESUMO
Specified Certainty Classification (SCC) classifiers whose outputs carry uncertainties, typically in the form of Bayesian posterior probabilities. By allowing the classifier output to be less precise than one of a set of atomic decisions, SCC allows all decisions to achieve a specified level of certainty, as well as provides insights into classifier behavior by examining all decisions that are possible. Our primary illustration is read classification for reference-guided genome assembly, but we demonstrate the breadth of SCC by also analyzing COVID-19 vaccination data.
RESUMO
There is still a lack of fast and accurate classification tools to identify the taxonomies of noisy long reads, which is a bottleneck to the use of the promising long-read metagenomic sequencing technologies. Herein, we propose de Bruijn graph-based Sparse Approximate Match Block Analyzer (deSAMBA), a tailored long-read classification approach that uses a novel pseudo alignment algorithm based on sparse approximate match block (SAMB). Benchmarks on real sequencing datasets demonstrate that deSAMBA enables to achieve high yields and fast speed simultaneously, which outperforms state-of-the-art tools and has many potentials to cutting-edge metagenomics studies.