Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 109
Filter
Add more filters

Publication year range
1.
Proc Natl Acad Sci U S A ; 118(12)2021 03 23.
Article in English | MEDLINE | ID: mdl-33723043

ABSTRACT

Maximal growth rate is a basic parameter of microbial lifestyle that varies over several orders of magnitude, with doubling times ranging from a matter of minutes to multiple days. Growth rates are typically measured using laboratory culture experiments. Yet, we lack sufficient understanding of the physiology of most microbes to design appropriate culture conditions for them, severely limiting our ability to assess the global diversity of microbial growth rates. Genomic estimators of maximal growth rate provide a practical solution to survey the distribution of microbial growth potential, regardless of cultivation status. We developed an improved maximal growth rate estimator and predicted maximal growth rates from over 200,000 genomes, metagenome-assembled genomes, and single-cell amplified genomes to survey growth potential across the range of prokaryotic diversity; extensions allow estimates from 16S rRNA sequences alone as well as weighted community estimates from metagenomes. We compared the growth rates of cultivated and uncultivated organisms to illustrate how culture collections are strongly biased toward organisms capable of rapid growth. Finally, we found that organisms naturally group into two growth classes and observed a bias in growth predictions for extremely slow-growing organisms. These observations ultimately led us to suggest evolutionary definitions of oligotrophy and copiotrophy based on the selective regime an organism occupies. We found that these growth classes are associated with distinct selective regimes and genomic functional potentials.


Subject(s)
Codon Usage , Metagenome , Metagenomics , Microbiological Phenomena/genetics , Single-Cell Analysis , Databases, Genetic , Evolution, Molecular , Metagenomics/methods , Prokaryotic Cells/physiology , Single-Cell Analysis/methods
2.
Environ Microbiol ; 25(3): 689-704, 2023 03.
Article in English | MEDLINE | ID: mdl-36478085

ABSTRACT

Marine Group I (MGI) Thaumarchaeota were originally described as chemoautotrophic nitrifiers, but molecular and isotopic evidence suggests heterotrophic and/or mixotrophic capabilities. Here, we investigated the quantity and composition of organic matter assimilated by individual, uncultured MGI cells from the Pacific Ocean to constrain their potential for mixotrophy and heterotrophy. We observed that most MGI cells did not assimilate carbon from any organic substrate provided (glucose, pyruvate, oxaloacetate, protein, urea, and amino acids). The minority of MGI cells that did assimilate it did so exclusively from nitrogenous substrates (urea, 15% of MGI and amino acids, 36% of MGI), and only as an auxiliary carbon source (<20% of that subset's total cellular carbon was derived from those substrates). At the population level, MGI assimilation of organic carbon comprised just 0.5%-11% of total biomass carbon. We observed extensive assimilation of inorganic carbon and urea- and amino acid-derived nitrogen (equal to that from ammonium), consistent with metagenomic and metatranscriptomic analyses performed here and previously showing a widespread potential for MGI to perform autotrophy and transport and degrade organic nitrogen. Our results constrain the quantity and composition of organic matter used by MGI and suggest they use it primarily to meet nitrogen demands for anabolism and nitrification.


Subject(s)
Archaea , Carbon , Archaea/metabolism , Carbon/metabolism , Amino Acids/metabolism , Urea/metabolism , Nitrogen/metabolism
3.
Bioinformatics ; 38(Suppl 1): i45-i52, 2022 06 24.
Article in English | MEDLINE | ID: mdl-35758806

ABSTRACT

MOTIVATION: Phage-host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage-host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH. RESULTS: On the validation set, ContigNet achieves 72-85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60-70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts. AVAILABILITY AND IMPLEMENTATION: The source code of ContigNet and related datasets can be downloaded from https://github.com/tianqitang1/ContigNet.


Subject(s)
Bacteriophages , Bacteria/genetics , Bacteriophages/genetics , Metagenome , Metagenomics , Neural Networks, Computer
4.
Nature ; 551(7681): 457-463, 2017 11 23.
Article in English | MEDLINE | ID: mdl-29088705

ABSTRACT

Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth's microbial diversity.


Subject(s)
Biodiversity , Earth, Planet , Microbiota/genetics , Animals , Archaea/genetics , Archaea/isolation & purification , Bacteria/genetics , Bacteria/isolation & purification , Ecology/methods , Gene Dosage , Geographic Mapping , Humans , Plants/microbiology , RNA, Ribosomal, 16S/analysis , RNA, Ribosomal, 16S/genetics
5.
Brief Bioinform ; 21(3): 777-790, 2020 05 21.
Article in English | MEDLINE | ID: mdl-30860572

ABSTRACT

In metagenomic studies of microbial communities, the short reads come from mixtures of genomes. Read assembly is usually an essential first step for the follow-up studies in metagenomic research. Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in their investigations. Many studies evaluating different assembly programs used either simulated metagenomes or real metagenomes with unknown genome compositions. However, the simulated datasets may not reflect the real complexities of metagenomic samples and the estimated assembly accuracy could be misleading due to the unknown genomes in real metagenomes. Therefore, hybrid strategies are required to evaluate the various read assemblers for metagenomic studies. In this paper, we benchmark the metagenomic read assemblers by mixing reads from real metagenomic datasets with reads from known genomes and evaluating the integrity, contiguity and accuracy of the assembly using the reads from the known genomes. We selected four advanced metagenome assemblers, MEGAHIT, MetaSPAdes, IDBA-UD and Faucet, for evaluation. We showed the strengths and weaknesses of these assemblers in terms of integrity, contiguity and accuracy for different variables, including the genetic difference of the real genomes with the genome sequences in the real metagenomic datasets and the sequencing depth of the simulated datasets. Overall, MetaSPAdes performs best in terms of integrity and continuity at the species-level, followed by MEGAHIT. Faucet performs best in terms of accuracy at the cost of worst integrity and continuity, especially at low sequencing depth. MEGAHIT has the highest genome fractions at the strain-level and MetaSPAdes has the overall best performance at the strain-level. MEGAHIT is the most efficient in our experiments. Availability: The source code is available at https://github.com/ziyewang/MetaAssemblyEval.


Subject(s)
Computational Biology/methods , Metagenomics , Algorithms , Datasets as Topic , High-Throughput Nucleotide Sequencing/methods , Microbiota/genetics
6.
Environ Microbiol ; 23(6): 3240-3250, 2021 06.
Article in English | MEDLINE | ID: mdl-33938123

ABSTRACT

Universal primers for SSU rRNA genes allow profiling of natural communities by simultaneously amplifying templates from Bacteria, Archaea, and Eukaryota in a single PCR reaction. Despite the potential to show relative abundance for all rRNA genes, universal primers are rarely used, due to various concerns including amplicon length variation and its effect on bioinformatic pipelines. We thus developed 16S and 18S rRNA mock communities and a bioinformatic pipeline to validate this approach. Using these mocks, we show that universal primers (515Y/926R) outperformed eukaryote-specific V4 primers in observed versus expected abundance correlations (slope = 0.88 vs. 0.67-0.79), and mock community members with single mismatches to the primer were strongly underestimated (threefold to eightfold). Using field samples, both primers yielded similar 18S beta-diversity patterns (Mantel test, p < 0.001) but differences in relative proportions of many rarer taxa. To test for length biases, we mixed mock communities (16S + 18S) before PCR and found a twofold underestimation of 18S sequences due to sequencing bias. Correcting for the twofold underestimation, we estimate that, in Southern California field samples (1.2-80 µm), there were averages of 35% 18S, 28% chloroplast 16S, and 37% prokaryote 16S rRNA genes. These data demonstrate the potential for universal primers to generate comprehensive microbiome profiles.


Subject(s)
High-Throughput Nucleotide Sequencing , Bias , Polymerase Chain Reaction , RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 18S/genetics , Sequence Analysis, DNA
7.
Proc Biol Sci ; 288(1961): 20211555, 2021 10 27.
Article in English | MEDLINE | ID: mdl-34666523

ABSTRACT

Clustered regularly interspaced short palindromic repeat (CRISPR)-Cas adaptive immune systems enable bacteria and archaea to efficiently respond to viral pathogens by creating a genomic record of previous encounters. These systems are broadly distributed across prokaryotic taxa, yet are surprisingly absent in a majority of organisms, suggesting that the benefits of adaptive immunity frequently do not outweigh the costs. Here, combining experiments and models, we show that a delayed immune response which allows viruses to transiently redirect cellular resources to reproduction, which we call 'immune lag', is extremely costly during viral outbreaks, even to completely immune hosts. Critically, the costs of lag are only revealed by examining the early, transient dynamics of a host-virus system occurring immediately after viral challenge. Lag is a basic parameter of microbial defence, relevant to all intracellular, post-infection antiviral defence systems, that has to-date been largely ignored by theoretical and experimental treatments of host-phage systems.


Subject(s)
Bacteriophages , Viruses , Archaea , Bacteria/genetics , CRISPR-Cas Systems , Disease Outbreaks
8.
Glob Chang Biol ; 26(10): 5613-5629, 2020 Oct.
Article in English | MEDLINE | ID: mdl-32715608

ABSTRACT

Western boundary currents (WBCs) redistribute heat and oligotrophic seawater from the tropics to temperate latitudes, with several displaying substantial climate change-driven intensification over the last century. Strengthening WBCs have been implicated in the poleward range expansion of marine macroflora and fauna, however, the impacts on the structure and function of temperate microbial communities are largely unknown. Here we show that the major subtropical WBC of the South Pacific Ocean, the East Australian Current (EAC), transports microbial assemblages that maintain tropical and oligotrophic (k-strategist) signatures, to seasonally displace more copiotrophic (r-strategist) temperate microbial populations within temperate latitudes of the Tasman Sea. We identified specific characteristics of EAC microbial assemblages compared with non-EAC assemblages, including strain transitions within the SAR11 clade, enrichment of Prochlorococcus, predicted smaller genome sizes and shifts in the importance of several functional genes, including those associated with cyanobacterial photosynthesis, secondary metabolism and fatty acid and lipid transport. At a temperate time-series site in the Tasman Sea, we observed significant reductions in standing stocks of total carbon and chlorophyll a, and a shift towards smaller phytoplankton and carnivorous copepods, associated with the seasonal impact of the EAC microbial assemblage. In light of the substantial shifts in microbial assemblage structure and function associated with the EAC, we conclude that climate-driven expansions of WBCs will expand the range of tropical oligotrophic microbes, and potentially profoundly impact the trophic status of temperate waters.


Subject(s)
Prochlorococcus , Seawater , Australia , Chlorophyll A , Pacific Ocean
9.
Environ Microbiol ; 21(8): 2948-2963, 2019 08.
Article in English | MEDLINE | ID: mdl-31106939

ABSTRACT

Currently defined ecotypes in marine cyanobacteria Prochlorococcus and Synechococcus likely contain subpopulations that themselves are ecologically distinct. We developed and applied high-throughput sequencing for the 16S-23S rRNA internally transcribed spacer (ITS) to examine ecotype and fine-scale genotypic community dynamics for monthly surface water samples spanning 5 years at the San Pedro Ocean Time-series site. Ecotype-level structure displayed regular seasonal patterns including succession, consistent with strong forcing by seasonally varying abiotic parameters (e.g. temperature, nutrients, light). We identified tens to thousands of amplicon sequence variants (ASVs) within ecotypes, many of which exhibited distinct patterns over time, suggesting ecologically distinct populations within ecotypes. Community structure within some ecotypes exhibited regular, seasonal patterns, but not for others, indicating other more irregular processes such as phage interactions are important. Network analysis including T4-like phage genotypic data revealed distinct viral variants correlated with different groups of cyanobacterial ASVs including time-lagged predator-prey relationships. Variation partitioning analysis indicated that phage community structure more strongly explains cyanobacterial community structure at the ASV level than the abiotic environmental factors. These results support a hierarchical model whereby abiotic environmental factors more strongly shape niche partitioning at the broader ecotype level while phage interactions are more important in shaping community structure of fine-scale variants within ecotypes.


Subject(s)
Bacteriophages/physiology , Prochlorococcus/virology , Seawater/microbiology , Synechococcus/virology , Bacteriophages/genetics , Ecosystem , Ecotype , Phylogeny , Prochlorococcus/genetics , RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 23S/genetics , Synechococcus/genetics , Water Microbiology
10.
Nature ; 554(7690): 38-39, 2018 Feb.
Article in English | MEDLINE | ID: mdl-32094830
11.
Nature ; 554(7690): 38-39, 2018 02 01.
Article in English | MEDLINE | ID: mdl-29388959
12.
Nucleic Acids Res ; 45(20): e169, 2017 Nov 16.
Article in English | MEDLINE | ID: mdl-28977511

ABSTRACT

High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningful interpretation remains challenging. In this work, we propose a scalable and tuning-free preprocessing framework, Heterogeneity Rescaling Pursuit (Hetero-RP), which weighs important features more highly than less important ones in accord with implicitly existing auxiliary knowledge. Finally, we demonstrate effectiveness of Hetero-RP in diverse clustering and classification applications. More importantly, Hetero-RP offers an interpretation of feature importance, shedding light on the driving forces of the underlying biology. In metagenomic contig binning, Hetero-RP automatically weighs abundance and composition profiles according to the varying number of samples, resulting in markedly improved performance of contig binning. In RNA-binding protein (RBP) binding site prediction, Hetero-RP not only improves the prediction performance measured by the area under the receiver operating characteristic curves (AUC), but also uncovers the evidence supported by independent studies, including the distribution of the binding sites of IGF2BP and PUM2, the binding competition between hnRNPC and U2AF2, and the intron-exon boundary of U2AF2 [availability: https://github.com/younglululu/Hetero-RP].


Subject(s)
Computational Biology/methods , Contig Mapping/methods , Genomics/methods , Heterogeneous-Nuclear Ribonucleoprotein Group C/genetics , RNA-Binding Proteins/genetics , Splicing Factor U2AF/genetics , Algorithms , Binding Sites/genetics , Heterogeneous-Nuclear Ribonucleoprotein Group C/metabolism , High-Throughput Nucleotide Sequencing/methods , Humans , RNA-Binding Proteins/metabolism , ROC Curve , Splicing Factor U2AF/metabolism
13.
Nucleic Acids Res ; 45(W1): W554-W559, 2017 07 03.
Article in English | MEDLINE | ID: mdl-28472388

ABSTRACT

Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.


Subject(s)
Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Software , Animals , Genome, Microbial , Internet , Metagenomics , Primates/genetics , Sequence Alignment , Vertebrates/genetics
14.
Nucleic Acids Res ; 45(1): 39-53, 2017 01 09.
Article in English | MEDLINE | ID: mdl-27899557

ABSTRACT

Viruses and their host genomes often share similar oligonucleotide frequency (ONF) patterns, which can be used to predict the host of a given virus by finding the host with the greatest ONF similarity. We comprehensively compared 11 ONF metrics using several k-mer lengths for predicting host taxonomy from among ∼32 000 prokaryotic genomes for 1427 virus isolate genomes whose true hosts are known. The background-subtracting measure [Formula: see text] at k = 6 gave the highest host prediction accuracy (33%, genus level) with reasonable computational times. Requiring a maximum dissimilarity score for making predictions (thresholding) and taking the consensus of the 30 most similar hosts further improved accuracy. Using a previous dataset of 820 bacteriophage and 2699 bacterial genomes, [Formula: see text] host prediction accuracies with thresholding and consensus methods (genus-level: 64%) exceeded previous Euclidian distance ONF (32%) or homology-based (22-62%) methods. When applied to metagenomically-assembled marine SUP05 viruses and the human gut virus crAssphage, [Formula: see text]-based predictions overlapped (i.e. some same, some different) with the previously inferred hosts of these viruses. The extent of overlap improved when only using host genomes or metagenomic contigs from the same habitat or samples as the query viruses. The [Formula: see text] ONF method will greatly improve the characterization of novel, metagenomic viruses.


Subject(s)
Bacteria/genetics , Bacteriophages/genetics , Metagenomics , Oligonucleotides/chemistry , Phylogeny , Bacteria/classification , Bacteria/virology , Bacteriophages/classification , Base Sequence , Gastrointestinal Tract/metabolism , Gastrointestinal Tract/virology , Genome, Bacterial , Genome, Human , Genome, Viral , Humans , Oligonucleotides/genetics , Sequence Homology, Nucleic Acid
15.
Environ Microbiol ; 20(8): 2809-2823, 2018 08.
Article in English | MEDLINE | ID: mdl-29659156

ABSTRACT

Aquatic environments contain large communities of microorganisms whose synergistic interactions mediate the cycling of major and trace nutrients, including vitamins. B-vitamins are essential coenzymes that many organisms cannot synthesize. Thus, their exchange among de novo synthesizers and auxotrophs is expected to play an important role in the microbial consortia and explain some of the temporal and spatial changes observed in diversity. In this study, we analyzed metatranscriptomes of a natural marine microbial community, diel sampled quarterly over one year to try to identify the potential major B-vitamin synthesizers and consumers. Transcriptomic data showed that the best-represented taxa dominated the expression of synthesis genes for some B-vitamins but lacked transcripts for others. For instance, Rhodobacterales dominated the expression of vitamin-B12 synthesis, but not of vitamin-B7 , whose synthesis transcripts were mainly represented by Flavobacteria. In contrast, bacterial groups that constituted less than 4% of the community (e.g., Verrucomicrobia) accounted for most of the vitamin-B1 synthesis transcripts. Furthermore, ambient vitamin-B1 concentrations were higher in samples collected during the day, and were positively correlated with chlorophyll-a concentrations. Our analysis supports the hypothesis that the mosaic of metabolic interdependencies through B-vitamin synthesis and exchange are key processes that contribute to shaping microbial communities in nature.


Subject(s)
Bacteria/metabolism , Microbial Consortia , Vitamin B Complex/metabolism , Alphaproteobacteria/genetics , Alphaproteobacteria/metabolism , Bacteria/genetics , Coenzymes/biosynthesis , Coenzymes/metabolism , Flavobacteriaceae/genetics , Flavobacteriaceae/metabolism , Transcriptome , Vitamin B Complex/biosynthesis
17.
Bioinformatics ; 33(6): 791-798, 2017 03 15.
Article in English | MEDLINE | ID: mdl-27256312

ABSTRACT

Motivation: The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples. Results: The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT. Availability and implementation: The software is available at https://github.com/younglululu/COCACOLA . Contact: fsun@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome, Bacterial , Metagenomics/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Bacteria/genetics , Cluster Analysis , High-Throughput Nucleotide Sequencing , Humans , Microbiota/genetics
18.
Microb Ecol ; 76(4): 866-884, 2018 Nov.
Article in English | MEDLINE | ID: mdl-29675703

ABSTRACT

Analysis of seasonal patterns of marine bacterial community structure along horizontal and vertical spatial scales can help to predict long-term responses to climate change. Several recent studies have shown predictable seasonal reoccurrence of bacterial assemblages. However, only a few have assessed temporal variability over both horizontal and vertical spatial scales. Here, we simultaneously studied the bacterial community structure at two different locations and depths in shelf waters of a coastal upwelling system during an annual cycle. The most noticeable biogeographic patterns observed were seasonality, horizontal homogeneity, and spatial synchrony in bacterial diversity and community structure related with regional upwelling-downwelling dynamics. Water column mixing eventually disrupted bacterial community structure vertical heterogeneity. Our results are consistent with previous temporal studies of marine bacterioplankton in other temperate regions and also suggest a marked influence of regional factors on the bacterial communities inhabiting this coastal upwelling system. Bacterial-mediated carbon fluxes in this productive region appear to be mainly controlled by community structure dynamics in surface waters, and local environmental factors at the base of the euphotic zone.


Subject(s)
Bacterial Physiological Phenomena , Climate Change , Phytoplankton/physiology , Water Movements , Atlantic Ocean , Microbiota , Seasons , Spain
19.
BMC Bioinformatics ; 18(Suppl 3): 60, 2017 Mar 14.
Article in English | MEDLINE | ID: mdl-28361670

ABSTRACT

BACKGROUND: The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations. METHODS: We investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments. RESULTS: Based on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and [Formula: see text] dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus. CONCLUSIONS: The random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments.


Subject(s)
Bacteria/virology , DNA, Viral/isolation & purification , Genome, Viral , Host-Pathogen Interactions , Support Vector Machine , Viruses/genetics , Bayes Theorem , DNA, Viral/genetics , Likelihood Functions , Logistic Models , Metagenomics , Models, Theoretical , ROC Curve , Reproducibility of Results , Sequence Analysis, DNA , Viruses/metabolism
20.
Environ Microbiol ; 19(6): 2434-2452, 2017 06.
Article in English | MEDLINE | ID: mdl-28418097

ABSTRACT

Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultured strains. We report the cultivation of Candidatus Nitrosomarinus catalina SPOT01, a novel strain that is less warm-temperature tolerant than other cultivated Thaumarchaeota. Using metagenomic recruitment, strain SPOT01 comprises a major portion of Thaumarchaeota (4-54%) in temperate Pacific waters. Its complete 1.36 Mbp genome possesses several distinguishing features: putative phosphorothioation (PT) DNA modification genes; a region containing probable viral genes; and putative urea utilization genes. The PT modification genes and an adjacent putative restriction enzyme (RE) operon likely form a restriction modification (RM) system for defence from foreign DNA. PacBio sequencing showed >98% methylation at two motifs, and inferred PT guanine modification of 19% of possible TGCA sites. Metagenomic recruitment also reveals the putative virus region and PT modification and RE genes are present in 18-26%, 9-14% and <1.5% of natural populations at 150 m with ≥85% identity to strain SPOT01. The presence of multiple probable RM systems in a highly streamlined genome suggests a surprising importance for defence from foreign DNA for dilute populations that infrequently encounter viruses or other cells. This new strain provides new insights into the ecology, including viral interactions, of this important group of marine microbes.


Subject(s)
Archaea , DNA, Archaeal/genetics , Genome, Archaeal/genetics , Viruses/genetics , Aquatic Organisms/genetics , Archaea/classification , Archaea/genetics , Archaea/virology , Base Sequence , Metagenomics , RNA, Ribosomal, 16S/genetics , Sequence Analysis, DNA
SELECTION OF CITATIONS
SEARCH DETAIL