RESUMEN
Spaceflight is known to impose changes on human physiology with unknown molecular etiologies. To reveal these causes, we used a multi-omics, systems biology analytical approach using biomedical profiles from fifty-nine astronauts and data from NASA's GeneLab derived from hundreds of samples flown in space to determine transcriptomic, proteomic, metabolomic, and epigenetic responses to spaceflight. Overall pathway analyses on the multi-omics datasets showed significant enrichment for mitochondrial processes, as well as innate immunity, chronic inflammation, cell cycle, circadian rhythm, and olfactory functions. Importantly, NASA's Twin Study provided a platform to confirm several of our principal findings. Evidence of altered mitochondrial function and DNA damage was also found in the urine and blood metabolic data compiled from the astronaut cohort and NASA Twin Study data, indicating mitochondrial stress as a consistent phenotype of spaceflight.
Asunto(s)
Genómica , Mitocondrias/patología , Vuelo Espacial , Estrés Fisiológico , Animales , Ritmo Circadiano , Matriz Extracelular/metabolismo , Humanos , Inmunidad Innata , Metabolismo de los Lípidos , Análisis de Flujos Metabólicos , Ratones Endogámicos BALB C , Ratones Endogámicos C57BL , Músculos/inmunología , Especificidad de Órganos , Olfato/fisiologíaRESUMEN
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, in which genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and nonrepetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudolabels for a small proportion of the nodes. We then use those pseudolabels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic data sets. The results on the simulated data highlight GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, experiments with synthetic metagenomic data sets reveal that incorporating the graph structure and the GNN enhances the detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
Asunto(s)
Metagenómica , Aprendizaje Automático Supervisado , Metagenómica/métodos , Secuencias Repetitivas de Ácidos Nucleicos , Redes Neurales de la Computación , Análisis de Secuencia de ADN/métodos , Algoritmos , MetagenomaRESUMEN
Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Metagenoma , Metagenómica , Microbiota , Metagenómica/métodos , Metagenoma/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Microbiota/genética , Humanos , Análisis de Secuencia de ADN/métodos , Biología Computacional/métodosRESUMEN
16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation-maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu.
Asunto(s)
Dromaiidae , Microbiota , Secuenciación de Nanoporos , Animales , Bacterias/genética , Dromaiidae/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Microbiota/genética , Filogenia , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN/métodosRESUMEN
MOTIVATION: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY AND IMPLEMENTATION: Parsnp v2 is available at https://github.com/marbl/parsnp.
Asunto(s)
Genoma Bacteriano , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Genómica/métodos , AlgoritmosRESUMEN
MOTIVATION: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining. RESULTS: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux. AVAILABILITY AND IMPLEMENTATION: rhea is open source and available at: https://github.com/treangenlab/rhea.
Asunto(s)
Genoma Bacteriano , Metagenoma , Microbiota , Microbiota/genética , Metagenómica/métodos , Transferencia de Gen Horizontal , Bacterias/genética , AlgoritmosRESUMEN
The COVID-19 pandemic has sparked an urgent need to uncover the underlying biology of this devastating disease. Though RNA viruses mutate more rapidly than DNA viruses, there are a relatively small number of single nucleotide polymorphisms (SNPs) that differentiate the main SARS-CoV-2 lineages that have spread throughout the world. In this study, we investigated 129 RNA-seq data sets and 6928 consensus genomes to contrast the intra-host and inter-host diversity of SARS-CoV-2. Our analyses yielded three major observations. First, the mutational profile of SARS-CoV-2 highlights intra-host single nucleotide variant (iSNV) and SNP similarity, albeit with differences in C > U changes. Second, iSNV and SNP patterns in SARS-CoV-2 are more similar to MERS-CoV than SARS-CoV-1. Third, a significant fraction of insertions and deletions contribute to the genetic diversity of SARS-CoV-2. Altogether, our findings provide insight into SARS-CoV-2 genomic diversity, inform the design of detection tests, and highlight the potential of iSNVs for tracking the transmission of SARS-CoV-2.
Asunto(s)
COVID-19/diagnóstico , COVID-19/transmisión , Variación Genética , Genoma Viral , Reacción en Cadena en Tiempo Real de la Polimerasa/métodos , SARS-CoV-2/genética , COVID-19/virología , Interacciones Huésped-Patógeno , Humanos , Polimorfismo de Nucleótido SimpleRESUMEN
MOTIVATION: Interactions among microbes within microbial communities have been shown to play crucial roles in human health. In spite of recent progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully decipher and control microbial communities. RESULTS: We present a novel approach for identifying species driving interactions within microbiomes. Bakdrive infers ecological networks of given metagenomic sequencing samples and identifies minimum sets of driver species (MDS) using control theory. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. In extensive simulated data, we demonstrate identifying driver species identified from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile (rCDI) infection patients to a healthy state. We also applied Bakdrive to two real datasets, rCDI and Crohn's disease patients, uncovering driver species consistent with previous work. Bakdrive represents a novel approach for capturing microbial interactions. AVAILABILITY AND IMPLEMENTATION: Bakdrive is open-source and available at: https://gitlab.com/treangenlab/bakdrive.
Asunto(s)
Enfermedad de Crohn , Microbioma Gastrointestinal , Microbiota , Humanos , Metagenoma , Bacterias/genéticaRESUMEN
MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap.
Asunto(s)
Biología Computacional , GenómicaRESUMEN
To identify sequences with a role in microbial pathogenesis, we assessed the adequacy of their annotation by existing controlled vocabularies and sequence databases. Our goal was to regularize descriptions of microbial pathogenesis for improved integration with bioinformatic applications. Here, we review the challenges of annotating sequences for pathogenic activity. We relate the categorization of more than 2,750 sequences of pathogenic microbes through a controlled vocabulary called Functions of Sequences of Concern (FunSoCs). These allow for an ease of description by both humans and machines. We provide a subset of 220 fully annotated sequences in the supplemental material as examples. The use of this compact (â¼30 terms), controlled vocabulary has potential benefits for research in microbial genomics, public health, biosecurity, biosurveillance, and the characterization of new and emerging pathogens.
Asunto(s)
Biología Computacional , Vocabulario Controlado , HumanosRESUMEN
BACKGROUND: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) epidemiology implicates airborne transmission; aerosol infectiousness and impacts of masks and variants on aerosol shedding are not well understood. METHODS: We recruited coronavirus disease 2019 (COVID-19) cases to give blood, saliva, mid-turbinate and fomite (phone) swabs, and 30-minute breath samples while vocalizing into a Gesundheit-II, with and without masks at up to 2 visits 2 days apart. We quantified and sequenced viral RNA, cultured virus, and assayed serum samples for anti-spike and anti-receptor binding domain antibodies. RESULTS: We enrolled 49 seronegative cases (mean days post onset 3.8â ±â 2.1), May 2020 through April 2021. We detected SARS-CoV-2 RNA in 36% of fine (≤5 µm), 26% of coarse (>5 µm) aerosols, and 52% of fomite samples overall and in all samples from 4 alpha variant cases. Masks reduced viral RNA by 48% (95% confidence interval [CI], 3 to 72%) in fine and by 77% (95% CI, 51 to 89%) in coarse aerosols; cloth and surgical masks were not significantly different. The alpha variant was associated with a 43-fold (95% CI, 6.6- to 280-fold) increase in fine aerosol viral RNA, compared with earlier viruses, that remained a significant 18-fold (95% CI, 3.4- to 92-fold) increase adjusting for viral RNA in saliva, swabs, and other potential confounders. Two fine aerosol samples, collected while participants wore masks, were culture-positive. CONCLUSIONS: SARS-CoV-2 is evolving toward more efficient aerosol generation and loose-fitting masks provide significant but only modest source control. Therefore, until vaccination rates are very high, continued layered controls and tight-fitting masks and respirators will be necessary.
Asunto(s)
COVID-19 , SARS-CoV-2 , COVID-19/prevención & control , Humanos , Máscaras , ARN Viral , Aerosoles y Gotitas RespiratoriasRESUMEN
As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
Asunto(s)
Algoritmos , Metagenómica/métodos , Probabilidad , Procesamiento de Señales Asistido por Computador , Humanos , Metagenoma/genéticaRESUMEN
Traumatic brain injury (TBI) causes neuroinflammation and neurodegeneration, both of which increase the risk and accelerate the progression of Alzheimer's disease (AD). The gut microbiome is an essential modulator of the immune system, impacting the brain. AD has been related with reduced diversity and alterations in the community composition of the gut microbiota. This study aimed to determine whether the gut microbiota from AD mice exacerbates neurological deficits after TBI in control mice. We prepared fecal microbiota transplants from 18 to 24 month old 3×Tg-AD (FMT-AD) and from healthy control (FMT-young) mice. FMTs were administered orally to young control C57BL/6 (wild-type, WT) mice after they underwent controlled cortical impact (CCI) injury, as a model of TBI. Then, we characterized the microbiota composition of the fecal samples by full-length 16S rRNA gene sequencing analysis. We collected the blood, brain, and gut tissues for protein and immunohistochemical analysis. Our results showed that FMT-AD administration stimulates a higher relative abundance of the genus Muribaculum and a decrease in Lactobacillus johnsonii compared to FMT-young in WT mice. Furthermore, WT mice exhibited larger lesion, increased activated microglia/macrophages, and reduced motor recovery after FMT-AD compared to FMT-young one day after TBI. In summary, we observed gut microbiota from AD mice to have a detrimental effect and aggravate the neuroinflammatory response and neurological outcomes after TBI in young WT mice.
Asunto(s)
Enfermedad de Alzheimer , Lesiones Traumáticas del Encéfalo , Enfermedad de Alzheimer/patología , Enfermedad de Alzheimer/terapia , Animales , Lesiones Traumáticas del Encéfalo/terapia , Trasplante de Microbiota Fecal/métodos , Ratones , Ratones Endogámicos C57BL , ARN Ribosómico 16S/genéticaRESUMEN
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Asunto(s)
Metagenoma , Metagenómica/métodos , Microbiota/genética , Algoritmos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Metagenómica/estadística & datos numéricos , Metagenómica/tendencias , Programas InformáticosRESUMEN
Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.
Asunto(s)
Biología Computacional/métodos , Secuencias Repetitivas de Ácidos Nucleicos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN , Análisis de Secuencia de ARN , Programas Informáticos , Algoritmos , Animales , ADN/genética , Genoma/genética , Humanos , Datos de Secuencia Molecular , Plantas , ARN/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/tendencias , Análisis de Secuencia de ARN/métodos , Análisis de Secuencia de ARN/tendenciasRESUMEN
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.
Asunto(s)
Algoritmos , Genómica/métodos , Análisis de Secuencia de ADN , Animales , Biología Computacional/métodos , Genoma , Genoma Bacteriano/genética , Humanos , Internet , Reproducibilidad de los ResultadosAsunto(s)
Bacterias/aislamiento & purificación , Hongos/aislamiento & purificación , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Virus/aislamiento & purificación , Bacterias/genética , Bases de Datos como Asunto , Brotes de Enfermedades , Hongos/genética , Humanos , Técnicas de Diagnóstico Molecular , Virus/genéticaRESUMEN
BACKGROUND: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. RESULTS: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. CONCLUSIONS: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.