RESUMEN
The Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org) provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database. GTDB R06-RS202 spans 254 090 bacterial and 4316 archaeal genomes, a 270% increase since the introduction of the GTDB in November, 2017. These genomes are organized into 45 555 bacterial and 2339 archaeal species clusters which is a 200% increase since the integration of species clusters into the GTDB in June, 2019. Here, we explore prokaryotic diversity from the perspective of the GTDB and highlight the importance of metagenome-assembled genomes in expanding available genomic representation. We also discuss improvements to the GTDB website which allow tracking of taxonomic changes, easy assessment of genome assembly quality, and identification of genomes assembled from type material or used as species representatives. Methodological updates and policy changes made since the inception of the GTDB are then described along with the procedure used to update species clusters in the GTDB. We conclude with a discussion on the use of average nucleotide identities as a pragmatic approach for delineating prokaryotic species.
Asunto(s)
Archaea/clasificación , Bacterias/clasificación , Bases de Datos Genéticas , Genoma Arqueal , Genoma Bacteriano , Programas Informáticos , Archaea/genética , Bacterias/genética , Secuencia de Bases , Internet , Metagenoma , Filogenia , Células Procariotas/clasificación , Células Procariotas/citología , Células Procariotas/metabolismoRESUMEN
SUMMARY: The Genome Taxonomy Database (GTDB) and associated taxonomic classification toolkit (GTDB-Tk) have been widely adopted by the microbiology community. However, the growing size of the GTDB bacterial reference tree has resulted in GTDB-Tk requiring substantial amounts of memory (â¼320 GB) which limits its adoption and ease of use. Here, we present an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives. This substantially reduces the memory requirements of GTDB-Tk while having minimal impact on classification. AVAILABILITY AND IMPLEMENTATION: GTDB-Tk is implemented in Python and licenced under the GNU General Public Licence v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Documentación , Programas InformáticosRESUMEN
SUMMARY: The GTDB Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB). GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10,156 bacterial and archaeal metagenome-assembled genomes. AVAILABILITY: GTDB-Tk is implemented in Python and licensed under the GNU General Public License v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
Summary: ArachnoServer is a manually curated database that consolidates information on the sequence, structure, function and pharmacology of spider-venom toxins. Although spider venoms are complex chemical arsenals, the primary constituents are small disulfide-bridged peptides that target neuronal ion channels and receptors. Due to their high potency and selectivity, these peptides have been developed as pharmacological tools, bioinsecticides and drug leads. A new version of ArachnoServer (v3.0) has been developed that includes a bioinformatics pipeline for automated detection and analysis of peptide toxin transcripts in assembled venom-gland transcriptomes. ArachnoServer v3.0 was updated with the latest sequence, structure and functional data, the search-by-mass feature has been enhanced, and toxin cards provide additional information about each mature toxin. Availability and implementation: http://arachnoserver.org. Contact: support@arachnoserver.org. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Venenos de Araña/química , Animales , Automatización de Laboratorios , Disulfuros/química , Proteínas de Insectos/química , Péptidos/química , Venenos de Araña/análisisRESUMEN
The Genome Taxonomy Database (GTDB) provides a species to domain classification of publicly available genomes based on average nucleotide identity (ANI) (for species) and a concatenated gene phylogeny normalized by evolutionary rates (for genus to phylum), which has been widely adopted by the scientific community. Here, we use the Genome UNClutterer (GUNC) software to identify putatively contaminated genomes in GTDB release 07-RS207. We found that GUNC reported 35,723 genomes as putatively contaminated, comprising 11.25â% of the 317,542 genomes in GTDB release 07-RS207. To assess the impact of this high level of inferred contamination on the delineation of taxa, we created 'clean' versions of the 34,846 putatively contaminated bacterial genomes by removing the most contaminated half. For each clean half, we re-calculated the ANI and concatenated gene phylogeny and found that only 77 (0.22â%) of the genomes were not consistent with their original classification. We conclude that the delineation of taxa in GTDB is robust to the putative contamination detected by GUNC.
Asunto(s)
Bacterias , Genoma Bacteriano , Filogenia , Bacterias/genética , Bacterias/clasificación , Programas Informáticos , Bases de Datos Genéticas , Contaminación de ADNRESUMEN
ArachnoServer (www.arachnoserver.org) is a manually curated database providing information on the sequence, structure and biological activity of protein toxins from spider venoms. These proteins are of interest to a wide range of biologists due to their diverse applications in medicine, neuroscience, pharmacology, drug discovery and agriculture. ArachnoServer currently manages 1078 protein sequences, 759 nucleic acid sequences and 56 protein structures. Key features of ArachnoServer include a molecular target ontology designed specifically for venom toxins, current and historic taxonomic information and a powerful advanced search interface. The following significant improvements have been implemented in version 2.0: (i) the average and monoisotopic molecular masses of both the reduced and oxidized form of each mature toxin are provided; (ii) the advanced search feature now enables searches on the basis of toxin mass, external database accession numbers and publication date in ArachnoServer; (iii) toxins can now be browsed on the basis of their phyletic specificity; (iv) rapid BLAST searches based on the mature toxin sequence can be performed directly from the toxin card; (v) private silos can be requested from research groups engaged in venoms-based research, enabling them to easily manage and securely store data during the process of toxin discovery; and (vi) a detailed user manual is now available.
Asunto(s)
Bases de Datos de Proteínas , Venenos de Araña/química , Animales , Internet , Proteínas/química , Proteínas/genética , Proteínas/toxicidad , Análisis de Secuencia , Venenos de Araña/genética , Venenos de Araña/toxicidad , Arañas/clasificaciónRESUMEN
The Genome Taxonomy Database (GTDB) is a taxonomic framework that defines prokaryotic taxa as monophyletic groups in concatenated protein reference trees according to systematic criteria. This has resulted in a substantial number of changes to existing classifications (https://gtdb.ecogenomic.org). In the case of union of taxa, GTDB names were applied based on the priority of publication. The division of taxa or change in rank led to the formation of new Latin names above the rank of genus that were only made publicly available via the GTDB website without associated published taxonomic descriptions. This has sometimes led to confusion in the literature and databases. A number of the provisional GTDB names were later published in other studies, while many still lack authorships. To reduce further confusion, here we propose names and descriptions for 329 GTDB-defined prokaryotic taxa, 223 of which are suitable for validation under the International Code of Nomenclature of Prokaryotes (ICNP) and 49 under the Code of Nomenclature of Prokaryotes described from Sequence Data (SeqCode). For the latter, we designated 23 genomes as type material. An additional 57 taxa that do not currently satisfy the validation criteria of either code are proposed as Candidatus.
Asunto(s)
Autoria , Células Procariotas , Bases de Datos FactualesRESUMEN
The accrual of genomic data from both cultured and uncultured microorganisms provides new opportunities to develop systematic taxonomies based on evolutionary relationships. Previously, we established a bacterial taxonomy through the Genome Taxonomy Database. Here, we propose a standardized archaeal taxonomy that is derived from a 122-concatenated-protein phylogeny that resolves polyphyletic groups and normalizes ranks based on relative evolutionary divergence. The resulting archaeal taxonomy, which forms part of the Genome Taxonomy Database, is stable for a range of phylogenetic variables including marker gene selection, inference methods, corrections for rate heterogeneity and compositional bias, tree rooting scenarios and expansion of the genome database. Rank normalization is shown to robustly correct for substitution rates varying up to 30-fold using simulated datasets. Taxonomic curation follows the rules of the International Code of Nomenclature of Prokaryotes while taking into account proposals to formally recognize the rank of phylum and to use genome sequences as type material. This taxonomy is based on 2,392 archaeal genomes, 93.3% of which required one or more changes to their existing taxonomy, mainly owing to incomplete classification. We identify 16 archaeal phyla and reclassify 3 major monophyletic units from the former Euryarchaeota and one phylum that unites the Thaumarchaeota-Aigarchaeota-Crenarchaeota-Korarchaeota (TACK) superphylum into a single phylum.
Asunto(s)
Archaea/clasificación , Bases de Datos Genéticas , Genoma Arqueal , Archaea/genética , Bases de Datos Genéticas/normas , Evolución Molecular , Genómica , Filogenia , Estándares de ReferenciaRESUMEN
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
RESUMEN
The Genome Taxonomy Database is a phylogenetically consistent, genome-based taxonomy that provides rank-normalized classifications for ~150,000 bacterial and archaeal genomes from domain to genus. However, almost 40% of the genomes in the Genome Taxonomy Database lack a species name. We address this limitation by using commonly accepted average nucleotide identity criteria to set bounds on species and propose species clusters that encompass all publicly available bacterial and archaeal genomes. Unlike previous average nucleotide identity studies, we chose a single representative genome to serve as the effective nomenclatural 'type' defining each species. Of the 24,706 proposed species clusters, 8,792 are based on published names. We assigned placeholder names to the remaining 15,914 species clusters to provide names to the growing number of genomes from uncultivated species. This resource provides a complete domain-to-species taxonomic framework for bacterial and archaeal genomes, which will facilitate research on uncultivated species and improve communication of scientific results.
Asunto(s)
Archaea/clasificación , Bacterias/clasificación , Filogenia , Archaea/genética , Bacterias/genética , Bases de Datos Genéticas , Genoma Arqueal/genética , Genoma Bacteriano/genética , Hibridación de Ácido Nucleico , Reproducibilidad de los ResultadosRESUMEN
Viruses of bacteria and archaea are important players in global carbon cycling as well as drivers of host evolution, yet the taxonomic classification of viruses remains a challenge due to their genetic diversity and absence of universally conserved genes. Traditional classification approaches employ a combination of phenotypic and genetic information which is no longer scalable in the era of bulk viral genome recovery through metagenomics. Here, we evaluate a phylogenetic approach for the classification of tailed double-stranded DNA viruses from the order Caudovirales by inferring a phylogeny from the concatenation of 77 single-copy protein markers using a maximum-likelihood method. Our approach is largely consistent with the International Committee on Taxonomy of Viruses, with 72 and 89% congruence at the subfamily and genus levels, respectively. Discrepancies could be attributed to misclassifications and a small number of highly mosaic genera confounding the phylogenetic signal. We also show that confidently resolved nodes in the concatenated protein tree are highly reproducible across different software and models, and conclude that the approach can serve as a framework for a rank-normalized taxonomy of most tailed double-stranded DNA viruses.
Asunto(s)
Caudovirales/clasificación , Virus ADN/clasificación , Filogenia , Proteínas Virales/clasificación , Archaea/virología , Bacterias/virología , Caudovirales/genética , Clasificación , Virus ADN/genética , Genes Virales/genética , Genoma Viral , Proteínas Virales/genéticaRESUMEN
Taxonomy is an organizing principle of biology and is ideally based on evolutionary relationships among organisms. Development of a robust bacterial taxonomy has been hindered by an inability to obtain most bacteria in pure culture and, to a lesser extent, by the historical use of phenotypes to guide classification. Culture-independent sequencing technologies have matured sufficiently that a comprehensive genome-based taxonomy is now possible. We used a concatenated protein phylogeny as the basis for a bacterial taxonomy that conservatively removes polyphyletic groups and normalizes taxonomic ranks on the basis of relative evolutionary divergence. Under this approach, 58% of the 94,759 genomes comprising the Genome Taxonomy Database had changes to their existing taxonomy. This result includes the description of 99 phyla, including six major monophyletic units from the subdivision of the Proteobacteria, and amalgamation of the Candidate Phyla Radiation into a single phylum. Our taxonomy should enable improved classification of uncultured bacteria and provide a sound basis for ecological and evolutionary studies.
Asunto(s)
Bacterias/clasificación , Bacterias/genética , Genoma Bacteriano , Filogenia , Bases de Datos Genéticas , Genómica , Programas InformáticosRESUMEN
In the original version of this Article, the authors stated that the archaeal phylum Parvarchaeota was previously represented by only two single-cell genomes (ARMAN-4_'5-way FS' and ARMAN-5_'5-way FS'). However, these are in fact unpublished, low-quality metagenome-assembled genomes (MAGs) obtained from Richmond Mine, California. In addition, the authors overlooked two higher-quality published Parvarchaeota MAGs from the same habitat, ARMAN-4 (ADCE00000000) and ARMAN-5 (ADHF00000000) (B. J. Baker et al., Proc. Natl Acad. Sci. USA 107, 8806-8811; 2010). The ARMAN-4 and ARMAN-5 MAGs are estimated to be 68.0% and 76.7% complete with 3.3% and 5.6% contamination, respectively, based on the archaeal-specific marker sets of CheckM. The 11 Parvarchaeota genomes identified in our study were obtained from different Richmond Mine metagenomes, but are highly similar to the ARMAN-4 (ANI of ~99.7%) and ARMAN-5 (ANI of ~99.6%) MAGs. The highest-quality uncultivated bacteria and archaea (UBA) MAGs with similarity to ARMAN-4 and ARMAN-5 are 82.5% and 83.3% complete with 0.9% and 1.9% contamination, respectively. The Parvarchaeota represents only 0.23% of the archaeal genome tree and addition of the ARMAN-4 and ARMAN-5 MAGs do not change the conclusions of this Article, but do impact the phylogenetic gain for this phylum. This has now been corrected in all versions of the Article. An updated version of Fig. 5 has also been used to replace the previous version, with the row for Parvarchaeota removed, and Supplementary Table 15 and Supplementary Table 17 have both been replaced to reflect the availability of the two additional Parvarchaeota genomes. In addition, the Methods incorrectly stated that all metagenomes identified as being from studies where MAGs had previously been recovered were excluded from consideration. Metagenomes from studies where MAGs had previously been recovered were retained if the UBA MAGs provided appreciable improvements in genome quality or phylogenetic diversity. All versions of the Article have been updated to indicate the retention of such metagenomes.
RESUMEN
Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >1,500 public metagenomes. All genomes are estimated to be ≥50% complete and nearly half are ≥90% complete with ≤5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.