RESUMO
Genomic islands are hotspots for horizontal gene transfer (HGT) in bacteria, but, for Prochlorococcus, an abundant marine cyanobacterium, how these islands form has puzzled scientists. With the discovery of tycheposons, a new family of transposons, Hackl et al. provide evidence for elegant new mechanisms of gene rearrangement and transfer among Prochlorococcus and bacteria more broadly.
Assuntos
Bacteriófagos , Cianobactérias , Bacteriófagos/genética , Transferência Genética Horizontal/genética , Cianobactérias/genética , RNA de Transferência/genética , Ilhas GenômicasRESUMO
Phytoplankton are limited by iron (Fe) in ~40% of the world's oceans including high-nutrient low-chlorophyll (HNLC) regions. While low-Fe adaptation has been well-studied in large eukaryotic diatoms, less is known for small, prokaryotic marine picocyanobacteria. This study reveals key physiological and genomic differences underlying Fe adaptation in marine picocyanobacteria. HNLC ecotype CRD1 strains have greater physiological tolerance to low Fe congruent with their expanded repertoire of Fe transporter, storage and regulatory genes compared to other ecotypes. From metagenomic analysis, genes encoding ferritin, flavodoxin, Fe transporters and siderophore uptake genes were more abundant in low-Fe waters, mirroring paradigms of low-Fe adaptation in diatoms. Distinct Fe-related gene repertories of HNLC ecotypes CRD1 and CRD2 also highlight how coexisting ecotypes have evolved independent approaches to life in low-Fe habitats. Synechococcus and Prochlorococcus HNLC ecotypes likewise exhibit independent, genome-wide reductions of predicted Fe-requiring genes. HNLC ecotype CRD1 interestingly was most similar to coastal ecotype I in Fe physiology and Fe-related gene content, suggesting populations from these different biomes experience similar Fe-selective conditions. This work supports an improved perspective that phytoplankton are shaped by more nuanced Fe niches in the oceans than previously implied from mostly binary comparisons of low- versus high-Fe habitats and populations.
Assuntos
Genoma Bacteriano/genética , Mosaicismo , Prochlorococcus/genética , Prochlorococcus/fisiologia , Synechococcus/genética , Synechococcus/fisiologia , Aclimatação/genética , Adaptação Fisiológica/genética , Diatomáceas/genética , Ecossistema , Ecótipo , Ferro/metabolismo , Metagenômica , Oceanos e Mares , Fitoplâncton , Água do Mar/microbiologiaRESUMO
Currently defined ecotypes in marine cyanobacteria Prochlorococcus and Synechococcus likely contain subpopulations that themselves are ecologically distinct. We developed and applied high-throughput sequencing for the 16S-23S rRNA internally transcribed spacer (ITS) to examine ecotype and fine-scale genotypic community dynamics for monthly surface water samples spanning 5 years at the San Pedro Ocean Time-series site. Ecotype-level structure displayed regular seasonal patterns including succession, consistent with strong forcing by seasonally varying abiotic parameters (e.g. temperature, nutrients, light). We identified tens to thousands of amplicon sequence variants (ASVs) within ecotypes, many of which exhibited distinct patterns over time, suggesting ecologically distinct populations within ecotypes. Community structure within some ecotypes exhibited regular, seasonal patterns, but not for others, indicating other more irregular processes such as phage interactions are important. Network analysis including T4-like phage genotypic data revealed distinct viral variants correlated with different groups of cyanobacterial ASVs including time-lagged predator-prey relationships. Variation partitioning analysis indicated that phage community structure more strongly explains cyanobacterial community structure at the ASV level than the abiotic environmental factors. These results support a hierarchical model whereby abiotic environmental factors more strongly shape niche partitioning at the broader ecotype level while phage interactions are more important in shaping community structure of fine-scale variants within ecotypes.
Assuntos
Bacteriófagos/fisiologia , Prochlorococcus/virologia , Água do Mar/microbiologia , Synechococcus/virologia , Bacteriófagos/genética , Ecossistema , Ecótipo , Filogenia , Prochlorococcus/genética , RNA Ribossômico 16S/genética , RNA Ribossômico 23S/genética , Synechococcus/genética , Microbiologia da ÁguaRESUMO
Synechococcus, a genus of unicellular cyanobacteria found throughout the global surface ocean, is a large driver of Earth's carbon cycle. Developing a better understanding of its diversity and distributions is an ongoing effort in biological oceanography. Here, we introduce 12 new draft genomes of marine Synechococcus isolates spanning five clades and utilize ~100 environmental metagenomes largely sourced from the TARA Oceans project to assess the global distributions of the genomic lineages they and other reference genomes represent. We show that five newly provided clade-II isolates are by far the most representative of the recovered in situ populations (most 'abundant') and have biogeographic distributions distinct from previously available clade-II references. Additionally, these isolates form a subclade possessing the smallest genomes yet identified of the genus (2.14 ± 0.05Mbps; mean ± 1SD) while concurrently hosting some of the highest GC contents (60.67 ± 0.16%). This is in direct opposition to the pattern in Synechococcus's nearest relative, Prochlorococcus - wherein decreasing genome size has coincided with a strong decrease in GC content - suggesting this new subclade of Synechococcus appears to have convergently undergone genomic reduction relative to the rest of the genus, but along a fundamentally different evolutionary trajectory.
Assuntos
Evolução Molecular , Genoma Bacteriano , Água do Mar/microbiologia , Synechococcus/genética , Composição de Bases , Genômica , Metagenoma , Oceanos e Mares , Filogenia , Prochlorococcus/genética , Synechococcus/classificação , Synechococcus/isolamento & purificação , Synechococcus/metabolismoRESUMO
Viruses and their host genomes often share similar oligonucleotide frequency (ONF) patterns, which can be used to predict the host of a given virus by finding the host with the greatest ONF similarity. We comprehensively compared 11 ONF metrics using several k-mer lengths for predicting host taxonomy from among â¼32 000 prokaryotic genomes for 1427 virus isolate genomes whose true hosts are known. The background-subtracting measure [Formula: see text] at k = 6 gave the highest host prediction accuracy (33%, genus level) with reasonable computational times. Requiring a maximum dissimilarity score for making predictions (thresholding) and taking the consensus of the 30 most similar hosts further improved accuracy. Using a previous dataset of 820 bacteriophage and 2699 bacterial genomes, [Formula: see text] host prediction accuracies with thresholding and consensus methods (genus-level: 64%) exceeded previous Euclidian distance ONF (32%) or homology-based (22-62%) methods. When applied to metagenomically-assembled marine SUP05 viruses and the human gut virus crAssphage, [Formula: see text]-based predictions overlapped (i.e. some same, some different) with the previously inferred hosts of these viruses. The extent of overlap improved when only using host genomes or metagenomic contigs from the same habitat or samples as the query viruses. The [Formula: see text] ONF method will greatly improve the characterization of novel, metagenomic viruses.
Assuntos
Bactérias/genética , Bacteriófagos/genética , Metagenômica , Oligonucleotídeos/química , Filogenia , Bactérias/classificação , Bactérias/virologia , Bacteriófagos/classificação , Sequência de Bases , Trato Gastrointestinal/metabolismo , Trato Gastrointestinal/virologia , Genoma Bacteriano , Genoma Humano , Genoma Viral , Humanos , Oligonucleotídeos/genética , Homologia de Sequência do Ácido NucleicoRESUMO
BACKGROUND: The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations. METHODS: We investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments. RESULTS: Based on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and [Formula: see text] dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus. CONCLUSIONS: The random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments.
Assuntos
Bactérias/virologia , DNA Viral/isolamento & purificação , Genoma Viral , Interações Hospedeiro-Patógeno , Máquina de Vetores de Suporte , Vírus/genética , Teorema de Bayes , DNA Viral/genética , Funções Verossimilhança , Modelos Logísticos , Metagenômica , Modelos Teóricos , Curva ROC , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Vírus/metabolismoRESUMO
Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultured strains. We report the cultivation of Candidatus Nitrosomarinus catalina SPOT01, a novel strain that is less warm-temperature tolerant than other cultivated Thaumarchaeota. Using metagenomic recruitment, strain SPOT01 comprises a major portion of Thaumarchaeota (4-54%) in temperate Pacific waters. Its complete 1.36 Mbp genome possesses several distinguishing features: putative phosphorothioation (PT) DNA modification genes; a region containing probable viral genes; and putative urea utilization genes. The PT modification genes and an adjacent putative restriction enzyme (RE) operon likely form a restriction modification (RM) system for defence from foreign DNA. PacBio sequencing showed >98% methylation at two motifs, and inferred PT guanine modification of 19% of possible TGCA sites. Metagenomic recruitment also reveals the putative virus region and PT modification and RE genes are present in 18-26%, 9-14% and <1.5% of natural populations at 150 m with ≥85% identity to strain SPOT01. The presence of multiple probable RM systems in a highly streamlined genome suggests a surprising importance for defence from foreign DNA for dilute populations that infrequently encounter viruses or other cells. This new strain provides new insights into the ecology, including viral interactions, of this important group of marine microbes.
Assuntos
Archaea , DNA Arqueal/genética , Genoma Arqueal/genética , Vírus/genética , Organismos Aquáticos/genética , Archaea/classificação , Archaea/genética , Archaea/virologia , Sequência de Bases , Metagenômica , RNA Ribossômico 16S/genética , Análise de Sequência de DNARESUMO
Many Proteobacteria possess LuxI-LuxR-type quorum-sensing systems that produce and detect fatty acyl-homoserine lactone (HSL) signals. The photoheterotroph Rhodopseudomonas palustris is unusual in that it produces and detects an aryl-HSL, p-coumaroyl-HSL, and signal production requires an exogenous source of p-coumarate. A photosynthetic stem-nodulating member of the genus Bradyrhizobium produces a small molecule signal that elicits an R. palustris quorum-sensing response. Here, we show that this signal is cinnamoyl-HSL and that cinnamoyl-HSL is produced by the LuxI homolog BraI and detected by BraR. Cinnamoyl-HSL reaches concentrations on the order of 50 nM in cultures of stem-nodulating bradyrhizobia grown in the presence or absence of cinnamate. Acyl-HSLs often reach concentrations of 0.1-30 µM in bacterial cultures, and generally, LuxR-type receptors respond to signals in a concentration range from 5 to a few hundred nanomolar. Our stem-nodulating Bradyrhizobium strain responds to picomolar concentrations of cinnamoyl-HSL and thus, produces cinnamoyl-HSL in excess of the levels required for a signal response without an exogenous source of cinnamate. The ability of Bradyrhizobium to produce and respond to cinnamoyl-HSL shows that aryl-HSL production is not unique to R. palustris, that the aromatic acid substrate for aryl-HSL synthesis does not have to be supplied exogenously, and that some acyl-HSL quorum-sensing systems may function at very low signal production and response levels.
Assuntos
Proteínas de Bactérias/metabolismo , Bradyrhizobium/metabolismo , Lactonas/farmacologia , Percepção de Quorum/fisiologia , Rodopseudomonas/metabolismo , Bradyrhizobium/citologia , Percepção de Quorum/efeitos dos fármacos , Rodopseudomonas/citologiaRESUMO
Nucleocytoplasmic Large DNA Viruses (NCLDVs, also called giant viruses) are widespread in marine systems and infect a broad range of microbial eukaryotes (protists). Recent biogeographic work has provided global snapshots of NCLDV diversity and community composition across the world's oceans, yet little information exists about the guiding 'rules' underpinning their community dynamics over time. We leveraged a five-year monthly metagenomic time-series to quantify the community composition of NCLDVs off the coast of Southern California and characterize these populations' temporal dynamics. NCLDVs were dominated by Algavirales (Phycodnaviruses, 59%) and Imitervirales (Mimiviruses, 36%). We identified clusters of NCLDVs with distinct classes of seasonal and non-seasonal temporal dynamics. Overall, NCLDV population abundances were often highly dynamic with a strong seasonal signal. The Imitervirales group had highest relative abundance in the more oligotrophic late summer and fall, while Algavirales did so in winter. Generally, closely related strains had similar temporal dynamics, suggesting that evolutionary history is a key driver of the temporal niche of marine NCLDVs. However, a few closely-related strains had drastically different seasonal dynamics, suggesting that while phylogenetic proximity often indicates ecological similarity, occasionally phenology can shift rapidly, possibly due to host-switching. Finally, we identified distinct functional content and possible host interactions of two major NCLDV orders-including connections of Imitervirales with primary producers like the diatom Chaetoceros and widespread marine grazers like Paraphysomonas and Spirotrichea ciliates. Together, our results reveal key insights on season-specific effect of phylogenetically distinct giant virus communities on marine protist metabolism, biogeochemical fluxes and carbon cycling.
RESUMO
Cyanophages exert important top-down controls on their cyanobacteria hosts; however, concurrent analysis of both phage and host populations is needed to better assess phage-host interaction models. We analyzed picocyanobacteria Prochlorococcus and Synechococcus and T4-like cyanophage communities in Pacific Ocean surface waters using five years of monthly viral and cellular fraction metagenomes. Cyanophage communities contained thousands of mostly low-abundance (<2% relative abundance) species with varying temporal dynamics, categorized as seasonally recurring or non-seasonal and occurring persistently, occasionally, or sporadically (detected in ≥85%, 15-85%, or <15% of samples, respectively). Viromes contained mostly seasonal and persistent phages (~40% each), while cellular fraction metagenomes had mostly sporadic species (~50%), reflecting that these sample sets capture different steps of the infection cycle-virions from prior infections or within currently infected cells, respectively. Two groups of seasonal phages correlated to Synechococcus or Prochlorococcus were abundant in spring/summer or fall/winter, respectively. Cyanophages likely have a strong influence on the host community structure, as their communities explained up to 32% of host community variation. These results support how both seasonally recurrent and apparent stochastic processes, likely determined by host availability and different host-range strategies among phages, are critical to phage-host interactions and dynamics, consistent with both the Kill-the-Winner and the Bank models.
Assuntos
Bacteriófagos , Synechococcus , Bacteriófagos/genética , Especificidade de Hospedeiro , Metagenoma , Oceano Pacífico , Estações do AnoRESUMO
People of different racial/ethnic backgrounds, demographics, health, and socioeconomic characteristics have experienced disproportionate rates of infection and death due to COVID-19. This study tests if and how county-level rates of infection and death have changed in relation to societal county characteristics through time as the pandemic progressed. This longitudinal study sampled monthly county-level COVID-19 case and death data per 100,000 residents from April 2020 to March 2022, and studied the relationships of these variables with racial/ethnic, demographic, health, and socioeconomic characteristics for 3125 or 97.0% of U.S. counties, accounting for 96.4% of the U.S. population. The association of all county-level characteristics with COVID-19 case and death rates changed significantly through time, and showed different patterns. For example, counties with higher population proportions of Black, Native American, foreign-born non-citizen, elderly residents, households in poverty, or higher income inequality suffered disproportionately higher COVID-19 case and death rates at the beginning of the pandemic, followed by reversed, attenuated or fluctuating patterns, depending on the variable. Patterns for counties with higher White versus Black population proportions showed somewhat inverse patterns. Counties with higher female population proportions initially had lower case rates but higher death rates, and case and death rates become more coupled and fluctuated later in the pandemic. Counties with higher population densities had fluctuating case and death rates, with peaks coinciding with new variants of COVID-19. Counties with a greater proportion of university-educated residents had lower case and death rates throughout the pandemic, although the strength of this relationship fluctuated through time. This research clearly shows that how different segments of society are affected by a pandemic changes through time. Therefore, targeted policies and interventions that change as a pandemic unfolds are necessary to mitigate its disproportionate effects on vulnerable populations, particularly during the first six months of a pandemic.
RESUMO
Marine Synechococcus spp. are unicellular cyanobacteria widely distributed in the world's oceans. We report the complete genome sequence of Synechococcus sp. strain NB0720_010, isolated from Narragansett Bay, Rhode Island. NB0702_10 has several large (>3,000-amino acid) protein-coding genes that may be important in its interactions with other cells, including grazers in estuarine habitats.
RESUMO
The marine unicellular cyanobacterium Prochlorococcus is the smallest-known oxygen-evolving autotroph. It numerically dominates the phytoplankton in the tropical and subtropical oceans, and is responsible for a significant fraction of global photosynthesis. Here we compare the genomes of two Prochlorococcus strains that span the largest evolutionary distance within the Prochlorococcus lineage and that have different minimum, maximum and optimal light intensities for growth. The high-light-adapted ecotype has the smallest genome (1,657,990 base pairs, 1,716 genes) of any known oxygenic phototroph, whereas the genome of its low-light-adapted counterpart is significantly larger, at 2,410,873 base pairs (2,275 genes). The comparative architectures of these two strains reveal dynamic genomes that are constantly changing in response to myriad selection pressures. Although the two strains have 1,350 genes in common, a significant number are not shared, and these have been differentially retained from the common ancestor, or acquired through duplication or lateral transfer. Some of these genes have obvious roles in determining the relative fitness of the ecotypes in response to key environmental variables, and hence in regulating their distribution and abundance in the oceans.
Assuntos
Evolução Biológica , Cianobactérias/classificação , Cianobactérias/genética , Meio Ambiente , Genoma Bacteriano , Adaptação Fisiológica/efeitos da radiação , Cianobactérias/efeitos da radiação , Genes Bacterianos/genética , Luz , Dados de Sequência Molecular , Oceanos e Mares , FilogeniaRESUMO
Viruses that infect microorganisms dominate marine microbial communities numerically, with impacts ranging from host evolution to global biogeochemical cycles1,2. However, virus community dynamics, necessary for conceptual and mechanistic model development, remains difficult to assess. Here, we describe the long-term stability of a viral community by analysing the metagenomes of near-surface 0.02-0.2 µm samples from the San Pedro Ocean Time-series3 that were sampled monthly over 5 years. Of 19,907 assembled viral contigs (>5 kb, mean 15 kb), 97% were found in each sample (by >98% ID metagenomic read recruitment) to have relative abundances that ranged over seven orders of magnitude, with limited temporal reordering of rank abundances along with little change in richness. Seasonal variations in viral community composition were superimposed on the overall stability; maximum community similarity occurred at 12-month intervals. Despite the stability of viral genotypic clusters that had 98% sequence identity, viral sequences showed transient variations in single-nucleotide polymorphisms (SNPs) and constant turnover of minor population variants, each rising and falling over a few months, reminiscent of Red Queen dynamics4. The rise and fall of variants within populations, interpreted through the perspective of known virus-host interactions5, is consistent with the hypothesis that fluctuating selection acts on a microdiverse cloud of strains, and this succession is associated with ever-shifting virus-host defences and counterdefences. This results in long-term virus-host coexistence that is facilitated by perpetually changing minor variants.
Assuntos
Organismos Aquáticos/virologia , Água do Mar/virologia , Vírus/genética , Microbiologia da Água , Organismos Aquáticos/classificação , Organismos Aquáticos/genética , DNA Viral/genética , Ecossistema , Genoma Viral , Interações entre Hospedeiro e Microrganismos/genética , Metagenoma , Microbiota , Oceano Pacífico , Polimorfismo de Nucleotídeo Único , Especificidade da EspécieRESUMO
BACKGROUND: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. METHODS: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning. RESULTS: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC. CONCLUSIONS: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
RESUMO
Metagenomic sequencing has greatly enhanced the discovery of viral genomic sequences; however, it remains challenging to identify the host(s) of these new viruses. We developed VirHostMatcher-Net, a flexible, network-based, Markov random field framework for predicting virus-prokaryote interactions using multiple, integrated features: CRISPR sequences and alignment-free similarity measures ([Formula: see text] and WIsH). Evaluation of this method on a benchmark set of 1462 known virus-prokaryote pairs yielded host prediction accuracy of 59% and 86% at the genus and phylum levels, representing 16-27% and 6-10% improvement, respectively, over previous single-feature prediction approaches. We applied our host prediction tool to crAssphage, a human gut phage, and two metagenomic virus datasets: marine viruses and viral contigs recovered from globally distributed, diverse habitats. Host predictions were frequently consistent with those of previous studies, but more importantly, this new tool made many more confident predictions than previous tools, up to nearly 3-fold more (n > 27 000), greatly expanding the diversity of known virus-host interactions.
RESUMO
Synechococcus bacteria are unicellular cyanobacteria that contribute significantly to global marine primary production. We report the nearly complete genome sequence of Synechococcus sp. strain MIT S9220, which lacks the nitrate utilization genes present in most marine Synechococcus genomes. Assembly also produced the complete genome sequence of a cyanophage present in the MIT S9220 culture.
RESUMO
Much of the diversity of prokaryotic viruses has yet to be described. In particular, there are no viral isolates that infect abundant, globally significant marine archaea including the phylum Thaumarchaeota. This phylum oxidizes ammonia, fixes inorganic carbon, and thus contributes to globally significant nitrogen and carbon cycles in the oceans. Metagenomics provides an alternative to culture-dependent means for identifying and characterizing viral diversity. Some viruses carry auxiliary metabolic genes (AMGs) that are acquired via horizontal gene transfer from their host(s), allowing inference of what host a virus infects. Here we present the discovery of 15 new genomically and ecologically distinct Thaumarchaeota virus populations, identified as contigs that encode viral capsid and thaumarchaeal ammonia monooxygenase genes (amoC). These viruses exhibit depth and latitude partitioning and are distributed globally in various marine habitats including pelagic waters, estuarine habitats, and hydrothermal plume water and sediments. We found evidence of viral amoC expression and that viral amoC AMGs sometimes comprise up to half of total amoC DNA copies in cellular fraction metagenomes, highlighting the potential impact of these viruses on N cycling in the oceans. Phylogenetics suggest they are potentially tailed viruses and share a common ancestor with related marine Euryarchaeota viruses. This work significantly expands our view of viruses of globally important marine Thaumarchaeota.
Assuntos
Archaea/virologia , Metagenoma , Oxirredutases/genética , Vírus/genética , Amônia/metabolismo , Ciclo do Carbono , Transferência Genética Horizontal , Biologia Marinha , Metagenômica , Nitrificação , Ciclo do Nitrogênio , Oceanos e Mares , Filogenia , Proteínas Virais/genética , Vírus/enzimologia , Vírus/isolamento & purificaçãoRESUMO
BACKGROUND: Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. METHODS: Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10 kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. RESULTS: Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥10 kb by 10 to 100-fold for low input metagenomes. CONCLUSIONS: PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.
RESUMO
BACKGROUND: Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses. METHODS: We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder's performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014. RESULTS: VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder's potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients. CONCLUSIONS: This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology.