RESUMO
Maximal growth rate is a basic parameter of microbial lifestyle that varies over several orders of magnitude, with doubling times ranging from a matter of minutes to multiple days. Growth rates are typically measured using laboratory culture experiments. Yet, we lack sufficient understanding of the physiology of most microbes to design appropriate culture conditions for them, severely limiting our ability to assess the global diversity of microbial growth rates. Genomic estimators of maximal growth rate provide a practical solution to survey the distribution of microbial growth potential, regardless of cultivation status. We developed an improved maximal growth rate estimator and predicted maximal growth rates from over 200,000 genomes, metagenome-assembled genomes, and single-cell amplified genomes to survey growth potential across the range of prokaryotic diversity; extensions allow estimates from 16S rRNA sequences alone as well as weighted community estimates from metagenomes. We compared the growth rates of cultivated and uncultivated organisms to illustrate how culture collections are strongly biased toward organisms capable of rapid growth. Finally, we found that organisms naturally group into two growth classes and observed a bias in growth predictions for extremely slow-growing organisms. These observations ultimately led us to suggest evolutionary definitions of oligotrophy and copiotrophy based on the selective regime an organism occupies. We found that these growth classes are associated with distinct selective regimes and genomic functional potentials.
Assuntos
Uso do Códon , Metagenoma , Metagenômica , Fenômenos Microbiológicos/genética , Análise de Célula Única , Bases de Dados Genéticas , Evolução Molecular , Metagenômica/métodos , Células Procarióticas/fisiologia , Análise de Célula Única/métodosRESUMO
Marine Group I (MGI) Thaumarchaeota were originally described as chemoautotrophic nitrifiers, but molecular and isotopic evidence suggests heterotrophic and/or mixotrophic capabilities. Here, we investigated the quantity and composition of organic matter assimilated by individual, uncultured MGI cells from the Pacific Ocean to constrain their potential for mixotrophy and heterotrophy. We observed that most MGI cells did not assimilate carbon from any organic substrate provided (glucose, pyruvate, oxaloacetate, protein, urea, and amino acids). The minority of MGI cells that did assimilate it did so exclusively from nitrogenous substrates (urea, 15% of MGI and amino acids, 36% of MGI), and only as an auxiliary carbon source (<20% of that subset's total cellular carbon was derived from those substrates). At the population level, MGI assimilation of organic carbon comprised just 0.5%-11% of total biomass carbon. We observed extensive assimilation of inorganic carbon and urea- and amino acid-derived nitrogen (equal to that from ammonium), consistent with metagenomic and metatranscriptomic analyses performed here and previously showing a widespread potential for MGI to perform autotrophy and transport and degrade organic nitrogen. Our results constrain the quantity and composition of organic matter used by MGI and suggest they use it primarily to meet nitrogen demands for anabolism and nitrification.
Assuntos
Archaea , Carbono , Archaea/metabolismo , Carbono/metabolismo , Aminoácidos/metabolismo , Ureia/metabolismo , Nitrogênio/metabolismoRESUMO
MOTIVATION: Phage-host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage-host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH. RESULTS: On the validation set, ContigNet achieves 72-85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60-70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts. AVAILABILITY AND IMPLEMENTATION: The source code of ContigNet and related datasets can be downloaded from https://github.com/tianqitang1/ContigNet.
Assuntos
Bacteriófagos , Bactérias/genética , Bacteriófagos/genética , Metagenoma , Metagenômica , Redes Neurais de ComputaçãoRESUMO
Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth's microbial diversity.
Assuntos
Biodiversidade , Planeta Terra , Microbiota/genética , Animais , Archaea/genética , Archaea/isolamento & purificação , Bactérias/genética , Bactérias/isolamento & purificação , Ecologia/métodos , Dosagem de Genes , Mapeamento Geográfico , Humanos , Plantas/microbiologia , RNA Ribossômico 16S/análise , RNA Ribossômico 16S/genéticaRESUMO
In metagenomic studies of microbial communities, the short reads come from mixtures of genomes. Read assembly is usually an essential first step for the follow-up studies in metagenomic research. Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in their investigations. Many studies evaluating different assembly programs used either simulated metagenomes or real metagenomes with unknown genome compositions. However, the simulated datasets may not reflect the real complexities of metagenomic samples and the estimated assembly accuracy could be misleading due to the unknown genomes in real metagenomes. Therefore, hybrid strategies are required to evaluate the various read assemblers for metagenomic studies. In this paper, we benchmark the metagenomic read assemblers by mixing reads from real metagenomic datasets with reads from known genomes and evaluating the integrity, contiguity and accuracy of the assembly using the reads from the known genomes. We selected four advanced metagenome assemblers, MEGAHIT, MetaSPAdes, IDBA-UD and Faucet, for evaluation. We showed the strengths and weaknesses of these assemblers in terms of integrity, contiguity and accuracy for different variables, including the genetic difference of the real genomes with the genome sequences in the real metagenomic datasets and the sequencing depth of the simulated datasets. Overall, MetaSPAdes performs best in terms of integrity and continuity at the species-level, followed by MEGAHIT. Faucet performs best in terms of accuracy at the cost of worst integrity and continuity, especially at low sequencing depth. MEGAHIT has the highest genome fractions at the strain-level and MetaSPAdes has the overall best performance at the strain-level. MEGAHIT is the most efficient in our experiments. Availability: The source code is available at https://github.com/ziyewang/MetaAssemblyEval.
Assuntos
Biologia Computacional/métodos , Metagenômica , Algoritmos , Conjuntos de Dados como Assunto , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Microbiota/genéticaRESUMO
Universal primers for SSU rRNA genes allow profiling of natural communities by simultaneously amplifying templates from Bacteria, Archaea, and Eukaryota in a single PCR reaction. Despite the potential to show relative abundance for all rRNA genes, universal primers are rarely used, due to various concerns including amplicon length variation and its effect on bioinformatic pipelines. We thus developed 16S and 18S rRNA mock communities and a bioinformatic pipeline to validate this approach. Using these mocks, we show that universal primers (515Y/926R) outperformed eukaryote-specific V4 primers in observed versus expected abundance correlations (slope = 0.88 vs. 0.67-0.79), and mock community members with single mismatches to the primer were strongly underestimated (threefold to eightfold). Using field samples, both primers yielded similar 18S beta-diversity patterns (Mantel test, p < 0.001) but differences in relative proportions of many rarer taxa. To test for length biases, we mixed mock communities (16S + 18S) before PCR and found a twofold underestimation of 18S sequences due to sequencing bias. Correcting for the twofold underestimation, we estimate that, in Southern California field samples (1.2-80 µm), there were averages of 35% 18S, 28% chloroplast 16S, and 37% prokaryote 16S rRNA genes. These data demonstrate the potential for universal primers to generate comprehensive microbiome profiles.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Viés , Reação em Cadeia da Polimerase , RNA Ribossômico 16S/genética , RNA Ribossômico 18S/genética , Análise de Sequência de DNARESUMO
Clustered regularly interspaced short palindromic repeat (CRISPR)-Cas adaptive immune systems enable bacteria and archaea to efficiently respond to viral pathogens by creating a genomic record of previous encounters. These systems are broadly distributed across prokaryotic taxa, yet are surprisingly absent in a majority of organisms, suggesting that the benefits of adaptive immunity frequently do not outweigh the costs. Here, combining experiments and models, we show that a delayed immune response which allows viruses to transiently redirect cellular resources to reproduction, which we call 'immune lag', is extremely costly during viral outbreaks, even to completely immune hosts. Critically, the costs of lag are only revealed by examining the early, transient dynamics of a host-virus system occurring immediately after viral challenge. Lag is a basic parameter of microbial defence, relevant to all intracellular, post-infection antiviral defence systems, that has to-date been largely ignored by theoretical and experimental treatments of host-phage systems.
Assuntos
Bacteriófagos , Vírus , Archaea , Bactérias/genética , Sistemas CRISPR-Cas , Surtos de DoençasRESUMO
Western boundary currents (WBCs) redistribute heat and oligotrophic seawater from the tropics to temperate latitudes, with several displaying substantial climate change-driven intensification over the last century. Strengthening WBCs have been implicated in the poleward range expansion of marine macroflora and fauna, however, the impacts on the structure and function of temperate microbial communities are largely unknown. Here we show that the major subtropical WBC of the South Pacific Ocean, the East Australian Current (EAC), transports microbial assemblages that maintain tropical and oligotrophic (k-strategist) signatures, to seasonally displace more copiotrophic (r-strategist) temperate microbial populations within temperate latitudes of the Tasman Sea. We identified specific characteristics of EAC microbial assemblages compared with non-EAC assemblages, including strain transitions within the SAR11 clade, enrichment of Prochlorococcus, predicted smaller genome sizes and shifts in the importance of several functional genes, including those associated with cyanobacterial photosynthesis, secondary metabolism and fatty acid and lipid transport. At a temperate time-series site in the Tasman Sea, we observed significant reductions in standing stocks of total carbon and chlorophyll a, and a shift towards smaller phytoplankton and carnivorous copepods, associated with the seasonal impact of the EAC microbial assemblage. In light of the substantial shifts in microbial assemblage structure and function associated with the EAC, we conclude that climate-driven expansions of WBCs will expand the range of tropical oligotrophic microbes, and potentially profoundly impact the trophic status of temperate waters.
Assuntos
Prochlorococcus , Água do Mar , Austrália , Clorofila A , Oceano PacíficoRESUMO
Currently defined ecotypes in marine cyanobacteria Prochlorococcus and Synechococcus likely contain subpopulations that themselves are ecologically distinct. We developed and applied high-throughput sequencing for the 16S-23S rRNA internally transcribed spacer (ITS) to examine ecotype and fine-scale genotypic community dynamics for monthly surface water samples spanning 5 years at the San Pedro Ocean Time-series site. Ecotype-level structure displayed regular seasonal patterns including succession, consistent with strong forcing by seasonally varying abiotic parameters (e.g. temperature, nutrients, light). We identified tens to thousands of amplicon sequence variants (ASVs) within ecotypes, many of which exhibited distinct patterns over time, suggesting ecologically distinct populations within ecotypes. Community structure within some ecotypes exhibited regular, seasonal patterns, but not for others, indicating other more irregular processes such as phage interactions are important. Network analysis including T4-like phage genotypic data revealed distinct viral variants correlated with different groups of cyanobacterial ASVs including time-lagged predator-prey relationships. Variation partitioning analysis indicated that phage community structure more strongly explains cyanobacterial community structure at the ASV level than the abiotic environmental factors. These results support a hierarchical model whereby abiotic environmental factors more strongly shape niche partitioning at the broader ecotype level while phage interactions are more important in shaping community structure of fine-scale variants within ecotypes.
Assuntos
Bacteriófagos/fisiologia , Prochlorococcus/virologia , Água do Mar/microbiologia , Synechococcus/virologia , Bacteriófagos/genética , Ecossistema , Ecótipo , Filogenia , Prochlorococcus/genética , RNA Ribossômico 16S/genética , RNA Ribossômico 23S/genética , Synechococcus/genética , Microbiologia da ÁguaRESUMO
High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningful interpretation remains challenging. In this work, we propose a scalable and tuning-free preprocessing framework, Heterogeneity Rescaling Pursuit (Hetero-RP), which weighs important features more highly than less important ones in accord with implicitly existing auxiliary knowledge. Finally, we demonstrate effectiveness of Hetero-RP in diverse clustering and classification applications. More importantly, Hetero-RP offers an interpretation of feature importance, shedding light on the driving forces of the underlying biology. In metagenomic contig binning, Hetero-RP automatically weighs abundance and composition profiles according to the varying number of samples, resulting in markedly improved performance of contig binning. In RNA-binding protein (RBP) binding site prediction, Hetero-RP not only improves the prediction performance measured by the area under the receiver operating characteristic curves (AUC), but also uncovers the evidence supported by independent studies, including the distribution of the binding sites of IGF2BP and PUM2, the binding competition between hnRNPC and U2AF2, and the intron-exon boundary of U2AF2 [availability: https://github.com/younglululu/Hetero-RP].
Assuntos
Biologia Computacional/métodos , Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Ribonucleoproteínas Nucleares Heterogêneas Grupo C/genética , Proteínas de Ligação a RNA/genética , Fator de Processamento U2AF/genética , Algoritmos , Sítios de Ligação/genética , Ribonucleoproteínas Nucleares Heterogêneas Grupo C/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Proteínas de Ligação a RNA/metabolismo , Curva ROC , Fator de Processamento U2AF/metabolismoRESUMO
Viruses and their host genomes often share similar oligonucleotide frequency (ONF) patterns, which can be used to predict the host of a given virus by finding the host with the greatest ONF similarity. We comprehensively compared 11 ONF metrics using several k-mer lengths for predicting host taxonomy from among â¼32 000 prokaryotic genomes for 1427 virus isolate genomes whose true hosts are known. The background-subtracting measure [Formula: see text] at k = 6 gave the highest host prediction accuracy (33%, genus level) with reasonable computational times. Requiring a maximum dissimilarity score for making predictions (thresholding) and taking the consensus of the 30 most similar hosts further improved accuracy. Using a previous dataset of 820 bacteriophage and 2699 bacterial genomes, [Formula: see text] host prediction accuracies with thresholding and consensus methods (genus-level: 64%) exceeded previous Euclidian distance ONF (32%) or homology-based (22-62%) methods. When applied to metagenomically-assembled marine SUP05 viruses and the human gut virus crAssphage, [Formula: see text]-based predictions overlapped (i.e. some same, some different) with the previously inferred hosts of these viruses. The extent of overlap improved when only using host genomes or metagenomic contigs from the same habitat or samples as the query viruses. The [Formula: see text] ONF method will greatly improve the characterization of novel, metagenomic viruses.
Assuntos
Bactérias/genética , Bacteriófagos/genética , Metagenômica , Oligonucleotídeos/química , Filogenia , Bactérias/classificação , Bactérias/virologia , Bacteriófagos/classificação , Sequência de Bases , Trato Gastrointestinal/metabolismo , Trato Gastrointestinal/virologia , Genoma Bacteriano , Genoma Humano , Genoma Viral , Humanos , Oligonucleotídeos/genética , Homologia de Sequência do Ácido NucleicoRESUMO
Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.
Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Animais , Genoma Microbiano , Internet , Metagenômica , Primatas/genética , Alinhamento de Sequência , Vertebrados/genéticaRESUMO
Aquatic environments contain large communities of microorganisms whose synergistic interactions mediate the cycling of major and trace nutrients, including vitamins. B-vitamins are essential coenzymes that many organisms cannot synthesize. Thus, their exchange among de novo synthesizers and auxotrophs is expected to play an important role in the microbial consortia and explain some of the temporal and spatial changes observed in diversity. In this study, we analyzed metatranscriptomes of a natural marine microbial community, diel sampled quarterly over one year to try to identify the potential major B-vitamin synthesizers and consumers. Transcriptomic data showed that the best-represented taxa dominated the expression of synthesis genes for some B-vitamins but lacked transcripts for others. For instance, Rhodobacterales dominated the expression of vitamin-B12 synthesis, but not of vitamin-B7 , whose synthesis transcripts were mainly represented by Flavobacteria. In contrast, bacterial groups that constituted less than 4% of the community (e.g., Verrucomicrobia) accounted for most of the vitamin-B1 synthesis transcripts. Furthermore, ambient vitamin-B1 concentrations were higher in samples collected during the day, and were positively correlated with chlorophyll-a concentrations. Our analysis supports the hypothesis that the mosaic of metabolic interdependencies through B-vitamin synthesis and exchange are key processes that contribute to shaping microbial communities in nature.
Assuntos
Bactérias/metabolismo , Consórcios Microbianos , Complexo Vitamínico B/metabolismo , Alphaproteobacteria/genética , Alphaproteobacteria/metabolismo , Bactérias/genética , Coenzimas/biossíntese , Coenzimas/metabolismo , Flavobacteriaceae/genética , Flavobacteriaceae/metabolismo , Transcriptoma , Complexo Vitamínico B/biossínteseRESUMO
Motivation: The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples. Results: The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT. Availability and implementation: The software is available at https://github.com/younglululu/COCACOLA . Contact: fsun@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma Bacteriano , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bactérias/genética , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Microbiota/genéticaRESUMO
Analysis of seasonal patterns of marine bacterial community structure along horizontal and vertical spatial scales can help to predict long-term responses to climate change. Several recent studies have shown predictable seasonal reoccurrence of bacterial assemblages. However, only a few have assessed temporal variability over both horizontal and vertical spatial scales. Here, we simultaneously studied the bacterial community structure at two different locations and depths in shelf waters of a coastal upwelling system during an annual cycle. The most noticeable biogeographic patterns observed were seasonality, horizontal homogeneity, and spatial synchrony in bacterial diversity and community structure related with regional upwelling-downwelling dynamics. Water column mixing eventually disrupted bacterial community structure vertical heterogeneity. Our results are consistent with previous temporal studies of marine bacterioplankton in other temperate regions and also suggest a marked influence of regional factors on the bacterial communities inhabiting this coastal upwelling system. Bacterial-mediated carbon fluxes in this productive region appear to be mainly controlled by community structure dynamics in surface waters, and local environmental factors at the base of the euphotic zone.
Assuntos
Fenômenos Fisiológicos Bacterianos , Mudança Climática , Fitoplâncton/fisiologia , Movimentos da Água , Oceano Atlântico , Microbiota , Estações do Ano , EspanhaRESUMO
BACKGROUND: The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations. METHODS: We investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments. RESULTS: Based on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and [Formula: see text] dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus. CONCLUSIONS: The random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments.
Assuntos
Bactérias/virologia , DNA Viral/isolamento & purificação , Genoma Viral , Interações Hospedeiro-Patógeno , Máquina de Vetores de Suporte , Vírus/genética , Teorema de Bayes , DNA Viral/genética , Funções Verossimilhança , Modelos Logísticos , Metagenômica , Modelos Teóricos , Curva ROC , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Vírus/metabolismoRESUMO
Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultured strains. We report the cultivation of Candidatus Nitrosomarinus catalina SPOT01, a novel strain that is less warm-temperature tolerant than other cultivated Thaumarchaeota. Using metagenomic recruitment, strain SPOT01 comprises a major portion of Thaumarchaeota (4-54%) in temperate Pacific waters. Its complete 1.36 Mbp genome possesses several distinguishing features: putative phosphorothioation (PT) DNA modification genes; a region containing probable viral genes; and putative urea utilization genes. The PT modification genes and an adjacent putative restriction enzyme (RE) operon likely form a restriction modification (RM) system for defence from foreign DNA. PacBio sequencing showed >98% methylation at two motifs, and inferred PT guanine modification of 19% of possible TGCA sites. Metagenomic recruitment also reveals the putative virus region and PT modification and RE genes are present in 18-26%, 9-14% and <1.5% of natural populations at 150 m with ≥85% identity to strain SPOT01. The presence of multiple probable RM systems in a highly streamlined genome suggests a surprising importance for defence from foreign DNA for dilute populations that infrequently encounter viruses or other cells. This new strain provides new insights into the ecology, including viral interactions, of this important group of marine microbes.