Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 75
Filter
Add more filters

Publication year range
1.
Genome Res ; 2024 Jul 19.
Article in English | MEDLINE | ID: mdl-39029947

ABSTRACT

Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, where genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudo-labels for a small proportion of the nodes. We then use those pseudo-labels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic datasets. The results on the simulated data highlight our GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, our experiments with synthetic metagenomic datasets reveal that incorporating the graph structure and the GNN enhances our detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.

2.
Nat Methods ; 21(6): 954-966, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38689099

ABSTRACT

Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.


Subject(s)
High-Throughput Nucleotide Sequencing , Metagenome , Metagenomics , Microbiota , Metagenomics/methods , Metagenome/genetics , High-Throughput Nucleotide Sequencing/methods , Microbiota/genetics , Humans , Sequence Analysis, DNA/methods , Computational Biology/methods
3.
Nat Methods ; 19(7): 845-853, 2022 07.
Article in English | MEDLINE | ID: mdl-35773532

ABSTRACT

16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation-maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu.


Subject(s)
Dromaiidae , Microbiota , Nanopore Sequencing , Animals , Bacteria/genetics , Dromaiidae/genetics , High-Throughput Nucleotide Sequencing/methods , Microbiota/genetics , Phylogeny , RNA, Ribosomal, 16S/genetics , Sequence Analysis, DNA/methods
4.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38724243

ABSTRACT

MOTIVATION: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY AND IMPLEMENTATION: Parsnp v2 is available at https://github.com/marbl/parsnp.


Subject(s)
Genome, Bacterial , Sequence Alignment , Software , Sequence Alignment/methods , Genomics/methods , Algorithms
5.
Bioinformatics ; 40(Supplement_1): i58-i67, 2024 Jun 28.
Article in English | MEDLINE | ID: mdl-38940156

ABSTRACT

MOTIVATION: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining. RESULTS: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux. AVAILABILITY AND IMPLEMENTATION: rhea is open source and available at: https://github.com/treangenlab/rhea.


Subject(s)
Genome, Bacterial , Metagenome , Microbiota , Microbiota/genetics , Metagenomics/methods , Gene Transfer, Horizontal , Bacteria/genetics , Algorithms
6.
Genome Res ; 31(4): 635-644, 2021 04.
Article in English | MEDLINE | ID: mdl-33602693

ABSTRACT

The COVID-19 pandemic has sparked an urgent need to uncover the underlying biology of this devastating disease. Though RNA viruses mutate more rapidly than DNA viruses, there are a relatively small number of single nucleotide polymorphisms (SNPs) that differentiate the main SARS-CoV-2 lineages that have spread throughout the world. In this study, we investigated 129 RNA-seq data sets and 6928 consensus genomes to contrast the intra-host and inter-host diversity of SARS-CoV-2. Our analyses yielded three major observations. First, the mutational profile of SARS-CoV-2 highlights intra-host single nucleotide variant (iSNV) and SNP similarity, albeit with differences in C > U changes. Second, iSNV and SNP patterns in SARS-CoV-2 are more similar to MERS-CoV than SARS-CoV-1. Third, a significant fraction of insertions and deletions contribute to the genetic diversity of SARS-CoV-2. Altogether, our findings provide insight into SARS-CoV-2 genomic diversity, inform the design of detection tests, and highlight the potential of iSNVs for tracking the transmission of SARS-CoV-2.


Subject(s)
COVID-19/diagnosis , COVID-19/transmission , Genetic Variation , Genome, Viral , Real-Time Polymerase Chain Reaction/methods , SARS-CoV-2/genetics , COVID-19/virology , Host-Pathogen Interactions , Humans , Polymorphism, Single Nucleotide
7.
Bioinformatics ; 39(39 Suppl 1): i47-i56, 2023 06 30.
Article in English | MEDLINE | ID: mdl-37387148

ABSTRACT

MOTIVATION: Interactions among microbes within microbial communities have been shown to play crucial roles in human health. In spite of recent progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully decipher and control microbial communities. RESULTS: We present a novel approach for identifying species driving interactions within microbiomes. Bakdrive infers ecological networks of given metagenomic sequencing samples and identifies minimum sets of driver species (MDS) using control theory. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. In extensive simulated data, we demonstrate identifying driver species identified from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile (rCDI) infection patients to a healthy state. We also applied Bakdrive to two real datasets, rCDI and Crohn's disease patients, uncovering driver species consistent with previous work. Bakdrive represents a novel approach for capturing microbial interactions. AVAILABILITY AND IMPLEMENTATION: Bakdrive is open-source and available at: https://gitlab.com/treangenlab/bakdrive.


Subject(s)
Crohn Disease , Gastrointestinal Microbiome , Microbiota , Humans , Metagenome , Bacteria/genetics
8.
Bioinformatics ; 39(9)2023 09 02.
Article in English | MEDLINE | ID: mdl-37603771

ABSTRACT

MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap.


Subject(s)
Computational Biology , Genomics
9.
Infect Immun ; 90(5): e0033421, 2022 05 19.
Article in English | MEDLINE | ID: mdl-34780277

ABSTRACT

To identify sequences with a role in microbial pathogenesis, we assessed the adequacy of their annotation by existing controlled vocabularies and sequence databases. Our goal was to regularize descriptions of microbial pathogenesis for improved integration with bioinformatic applications. Here, we review the challenges of annotating sequences for pathogenic activity. We relate the categorization of more than 2,750 sequences of pathogenic microbes through a controlled vocabulary called Functions of Sequences of Concern (FunSoCs). These allow for an ease of description by both humans and machines. We provide a subset of 220 fully annotated sequences in the supplemental material as examples. The use of this compact (∼30 terms), controlled vocabulary has potential benefits for research in microbial genomics, public health, biosecurity, biosurveillance, and the characterization of new and emerging pathogens.


Subject(s)
Computational Biology , Vocabulary, Controlled , Humans
10.
Nucleic Acids Res ; 48(10): 5217-5234, 2020 06 04.
Article in English | MEDLINE | ID: mdl-32338745

ABSTRACT

As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.


Subject(s)
Algorithms , Metagenomics/methods , Probability , Signal Processing, Computer-Assisted , Humans , Metagenome/genetics
11.
Int J Mol Sci ; 23(9)2022 Apr 19.
Article in English | MEDLINE | ID: mdl-35562867

ABSTRACT

Traumatic brain injury (TBI) causes neuroinflammation and neurodegeneration, both of which increase the risk and accelerate the progression of Alzheimer's disease (AD). The gut microbiome is an essential modulator of the immune system, impacting the brain. AD has been related with reduced diversity and alterations in the community composition of the gut microbiota. This study aimed to determine whether the gut microbiota from AD mice exacerbates neurological deficits after TBI in control mice. We prepared fecal microbiota transplants from 18 to 24 month old 3×Tg-AD (FMT-AD) and from healthy control (FMT-young) mice. FMTs were administered orally to young control C57BL/6 (wild-type, WT) mice after they underwent controlled cortical impact (CCI) injury, as a model of TBI. Then, we characterized the microbiota composition of the fecal samples by full-length 16S rRNA gene sequencing analysis. We collected the blood, brain, and gut tissues for protein and immunohistochemical analysis. Our results showed that FMT-AD administration stimulates a higher relative abundance of the genus Muribaculum and a decrease in Lactobacillus johnsonii compared to FMT-young in WT mice. Furthermore, WT mice exhibited larger lesion, increased activated microglia/macrophages, and reduced motor recovery after FMT-AD compared to FMT-young one day after TBI. In summary, we observed gut microbiota from AD mice to have a detrimental effect and aggravate the neuroinflammatory response and neurological outcomes after TBI in young WT mice.


Subject(s)
Alzheimer Disease , Brain Injuries, Traumatic , Alzheimer Disease/pathology , Alzheimer Disease/therapy , Animals , Brain Injuries, Traumatic/therapy , Fecal Microbiota Transplantation/methods , Mice , Mice, Inbred C57BL , RNA, Ribosomal, 16S/genetics
12.
Brief Bioinform ; 20(4): 1140-1150, 2019 07 19.
Article in English | MEDLINE | ID: mdl-28968737

ABSTRACT

Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.


Subject(s)
Metagenome , Metagenomics/methods , Microbiota/genetics , Algorithms , Computational Biology , Databases, Genetic/statistics & numerical data , High-Throughput Nucleotide Sequencing/statistics & numerical data , Metagenomics/statistics & numerical data , Metagenomics/trends , Software
13.
Nat Rev Genet ; 13(1): 36-46, 2011 Nov 29.
Article in English | MEDLINE | ID: mdl-22124482

ABSTRACT

Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.


Subject(s)
Computational Biology/methods , Repetitive Sequences, Nucleic Acid , Sequence Alignment/methods , Sequence Analysis, DNA , Sequence Analysis, RNA , Software , Algorithms , Animals , DNA/genetics , Genome/genetics , Humans , Molecular Sequence Data , Plants , RNA/genetics , Repetitive Sequences, Nucleic Acid/genetics , Reproducibility of Results , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/trends , Sequence Analysis, RNA/methods , Sequence Analysis, RNA/trends
14.
Genome Res ; 22(3): 557-67, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22147368

ABSTRACT

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.


Subject(s)
Algorithms , Genomics/methods , Sequence Analysis, DNA , Animals , Computational Biology/methods , Genome , Genome, Bacterial/genetics , Humans , Internet , Reproducibility of Results
16.
BMC Bioinformatics ; 15: 126, 2014 May 03.
Article in English | MEDLINE | ID: mdl-24884846

ABSTRACT

BACKGROUND: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. RESULTS: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. CONCLUSIONS: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.


Subject(s)
Genome, Microbial , Genomics/methods , Software , Genome, Bacterial , Mycobacterium tuberculosis/genetics , Rhodobacter sphaeroides/genetics , Sequence Analysis, DNA
17.
PLoS Genet ; 7(1): e1001284, 2011 Jan 27.
Article in English | MEDLINE | ID: mdl-21298028

ABSTRACT

Gene duplication followed by neo- or sub-functionalization deeply impacts the evolution of protein families and is regarded as the main source of adaptive functional novelty in eukaryotes. While there is ample evidence of adaptive gene duplication in prokaryotes, it is not clear whether duplication outweighs the contribution of horizontal gene transfer in the expansion of protein families. We analyzed closely related prokaryote strains or species with small genomes (Helicobacter, Neisseria, Streptococcus, Sulfolobus), average-sized genomes (Bacillus, Enterobacteriaceae), and large genomes (Pseudomonas, Bradyrhizobiaceae) to untangle the effects of duplication and horizontal transfer. After removing the effects of transposable elements and phages, we show that the vast majority of expansions of protein families are due to transfer, even among large genomes. Transferred genes--xenologs--persist longer in prokaryotic lineages possibly due to a higher/longer adaptive role. On the other hand, duplicated genes--paralogs--are expressed more, and, when persistent, they evolve slower. This suggests that gene transfer and gene duplication have very different roles in shaping the evolution of biological systems: transfer allows the acquisition of new functions and duplication leads to higher gene dosage. Accordingly, we show that paralogs share most protein-protein interactions and genetic regulators, whereas xenologs share very few of them. Prokaryotes invented most of life's biochemical diversity. Therefore, the study of the evolution of biology systems should explicitly account for the predominant role of horizontal gene transfer in the diversification of protein families.


Subject(s)
Bacteria/genetics , Evolution, Molecular , Gene Duplication/genetics , Gene Transfer, Horizontal , Genome, Bacterial , Bacillus/genetics , Bradyrhizobiaceae/genetics , Computational Biology , Enterobacteriaceae/genetics , Helicobacter/genetics , Multigene Family/genetics , Neisseria/genetics , Phylogeny , Pseudomonas/genetics , Streptococcus/genetics , Sulfolobus/genetics
18.
Pac Symp Biocomput ; 29: 506-520, 2024.
Article in English | MEDLINE | ID: mdl-38160303

ABSTRACT

The microbes present in the human gastrointestinal tract are regularly linked to human health and disease outcomes. Thanks to technological and methodological advances in recent years, metagenomic sequencing data, and computational methods designed to analyze metagenomic data, have contributed to improved understanding of the link between the human gut microbiome and disease. However, while numerous methods have been recently developed to extract quantitative and qualitative results from host-associated microbiome data, improved computational tools are still needed to track microbiome dynamics with short-read sequencing data. Previously we have proposed KOMB as a de novo tool for identifying copy number variations in metagenomes for characterizing microbial genome dynamics in response to perturbations. In this work, we present KombOver (KO), which includes four key contributions with respect to our previous work: (i) it scales to large microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) we provide the foundation of a theoretical understanding of the relation between various graph-based metagenome representations, and (iv) we provide an improved user experience with easier-to-run code and more descriptive outputs/results. To highlight the aforementioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we highlight how graph-based approaches such as k-core and K-truss can be informative for pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: https://github.com/treangenlab/komb.


Subject(s)
Gastrointestinal Microbiome , Microbiota , Humans , DNA Copy Number Variations , Computational Biology , Microbiota/genetics , Metagenome , Metagenomics/methods
19.
J Clin Invest ; 134(2)2024 Jan 16.
Article in English | MEDLINE | ID: mdl-37962956

ABSTRACT

Targeted metagenomic sequencing is an emerging strategy to survey disease-specific microbiome biomarkers for clinical diagnosis and prognosis. However, this approach often yields inconsistent or conflicting results owing to inadequate study power and sequencing bias. We introduce Taxa4Meta, a bioinformatics pipeline explicitly designed to compensate for technical and demographic bias. We designed and validated Taxa4Meta for accurate taxonomic profiling of 16S rRNA amplicon data acquired from different sequencing strategies. Taxa4Meta offers significant potential in identifying clinical dysbiotic features that can reliably predict human disease, validated comprehensively via reanalysis of individual patient 16S data sets. We leveraged the power of Taxa4Meta's pan-microbiome profiling to generate 16S-based classifiers that exhibited excellent utility for stratification of diarrheal patients with Clostridioides difficile infection, irritable bowel syndrome, or inflammatory bowel diseases, which represent common misdiagnoses and pose significant challenges for clinical management. We believe that Taxa4Meta represents a new "best practices" approach to individual microbiome surveys that can be used to define gut dysbiosis at a population-scale level.


Subject(s)
Gastrointestinal Microbiome , Microbiota , Humans , Dysbiosis , RNA, Ribosomal, 16S/genetics , Diarrhea/genetics
20.
bioRxiv ; 2024 Jan 31.
Article in English | MEDLINE | ID: mdl-38352342

ABSTRACT

Motivation: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. Results: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4x and reduce runtime by over 2x, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability: Parsnp is available at https://github.com/marbl/parsnp.

SELECTION OF CITATIONS
SEARCH DETAIL