RESUMEN
BACKGROUND: Although high-throughput marker gene studies provide valuable insight into the diversity and relative abundance of taxa in microbial communities, they do not provide direct measures of their functional capacity. Recently, scientists have shown a general desire to predict functional profiles of microbial communities based on phylogenetic identification inferred from marker genes, and recent tools have been developed to link the two. However, to date, no large-scale examination has quantified the correlation between the marker gene based taxonomic identity and protein coding gene conservation. Here we utilize 4872 representative prokaryotic genomes from NCBI to investigate the relationship between marker gene identity and shared protein coding gene content. RESULTS: Even at 99-100% marker gene identity, genomes share on average less than 75% of their protein coding gene content. This occurs regardless of the marker gene(s) used: V4 region of the 16S rRNA, complete 16S rRNA, or single copy orthologs through a multi-locus sequence analysis. An important aspect related to this observation is the intra-organism variation of 16S copies from a single genome. Although the majority of 16S copies were found to have high sequence similarity (> 99%), several genomes contained copies that were highly diverged (< 97% identity). CONCLUSIONS: This is the largest comparison between marker gene similarity and shared protein coding gene content to date. The study highlights the limitations of inferring a microbial community's functions based on marker gene phylogeny. The data presented expands upon the results of previous studies that examined one or few bacterial species and supports the hypothesis that 16S rRNA and other marker genes cannot be directly used to fully predict the functional potential of a bacterial community.