RESUMO
BACKGROUND: Viruses with double-stranded (ds) DNA genomes in the realm Duplodnaviria share a conserved structural gene module but show a broad range of variation in their repertoires of DNA replication proteins. Some of the duplodnaviruses encode (nearly) complete replication systems whereas others lack (almost) all genes required for replication, relying on the host replication machinery. DNA polymerases (DNAPs) comprise the centerpiece of the DNA replication apparatus. The replicative DNAPs are classified into 4 unrelated or distantly related families (A-D), with the protein structures and sequences within each family being, generally, highly conserved. More than half of the duplodnaviruses encode a DNAP of family A, B or C. We showed previously that multiple pairs of closely related viruses in the order Crassvirales encode DNAPs of different families. METHODS: Groups of phages in which DNAP swapping likely occurred were identified as subtrees of a defined depth in a comprehensive evolutionary tree of tailed bacteriophages that included phages with DNAPs of different families. The DNAP swaps were validated by constrained tree analysis that was performed on phylogenetic tree of large terminase subunits, and the phage genomes encoding swapped DNAPs were aligned using Mauve. The structures of the discovered unusual DNAPs were predicted using AlphaFold2. RESULTS: We identified four additional groups of tailed phages in the class Caudoviricetes in which the DNAPs apparently were swapped on multiple occasions, with replacements occurring both between families A and B, or A and C, or between distinct subfamilies within the same family. The DNAP swapping always occurs "in situ", without changes in the organization of the surrounding genes. In several cases, the DNAP gene is the only region of substantial divergence between closely related phage genomes, whereas in others, the swap apparently involved neighboring genes encoding other proteins involved in phage genome replication. In addition, we identified two previously undetected, highly divergent groups of family A DNAPs that are encoded in some phage genomes along with the main DNAP implicated in genome replication. CONCLUSIONS: Replacement of the DNAP gene by one encoding a DNAP of a different family occurred on many independent occasions during the evolution of different families of tailed phages, in some cases, resulting in very closely related phages encoding unrelated DNAPs. DNAP swapping was likely driven by selection for avoidance of host antiphage mechanisms targeting the phage DNAP that remain to be identified, and/or by selection against replicon incompatibility.
Assuntos
DNA Polimerase Dirigida por DNA , Filogenia , Proteínas Virais , DNA Polimerase Dirigida por DNA/genética , Proteínas Virais/genética , Proteínas Virais/metabolismo , Evolução Molecular , Genoma Viral , Caudovirales/genética , Caudovirales/classificação , DNA Viral/genética , Bacteriófagos/genética , Bacteriófagos/enzimologia , Bacteriófagos/classificação , Replicação do DNARESUMO
Tailed bacteriophages are the most abundant and diverse viruses in the world, with genome sizes ranging from 10 kbp to over 500 kbp. Yet, due to historical reasons, all this diversity is confined to a single virus order-Caudovirales, composed of just four families: Myoviridae, Siphoviridae, Podoviridae, and the newly created Ackermannviridae family. In recent years, this morphology-based classification scheme has started to crumble under the constant flood of phage sequences, revealing that tailed phages are even more genetically diverse than once thought. This prompted us, the Bacterial and Archaeal Viruses Subcommittee of the International Committee on Taxonomy of Viruses (ICTV), to consider overall reorganization of phage taxonomy. In this study, we used a wide range of complementary methods-including comparative genomics, core genome analysis, and marker gene phylogenetics-to show that the group of Bacillus phage SPO1-related viruses previously classified into the Spounavirinae subfamily, is clearly distinct from other members of the family Myoviridae and its diversity deserves the rank of an autonomous family. Thus, we removed this group from the Myoviridae family and created the family Herelleviridae-a new taxon of the same rank. In the process of the taxon evaluation, we explored the feasibility of different demarcation criteria and critically evaluated the usefulness of our methods for phage classification. The convergence of results, drawing a consistent and comprehensive picture of a new family with associated subfamilies, regardless of method, demonstrates that the tools applied here are particularly useful in phage taxonomy. We are convinced that creation of this novel family is a crucial milestone toward much-needed reclassification in the Caudovirales order.
Assuntos
Caudovirales/classificação , Filogenia , Caudovirales/genética , Classificação , Genoma Viral/genéticaRESUMO
Antimicrobial resistance (AMR) is a major public health problem that requires publicly available tools for rapid analysis. To identify AMR genes in whole-genome sequences, the National Center for Biotechnology Information (NCBI) has produced AMRFinder, a tool that identifies AMR genes using a high-quality curated AMR gene reference database. The Bacterial Antimicrobial Resistance Reference Gene Database consists of up-to-date gene nomenclature, a set of hidden Markov models (HMMs), and a curated protein family hierarchy. Currently, it contains 4,579 antimicrobial resistance proteins and more than 560 HMMs. Here, we describe AMRFinder and its associated database. To assess the predictive ability of AMRFinder, we measured the consistency between predicted AMR genotypes from AMRFinder and resistance phenotypes of 6,242 isolates from the National Antimicrobial Resistance Monitoring System (NARMS). This included 5,425 Salmonella enterica, 770 Campylobacter spp., and 47 Escherichia coli isolates phenotypically tested against various antimicrobial agents. Of 87,679 susceptibility tests performed, 98.4% were consistent with predictions. To assess the accuracy of AMRFinder, we compared its gene symbol output with that of a 2017 version of ResFinder, another publicly available resistance gene detection system. Most gene calls were identical, but there were 1,229 gene symbol differences (8.8%) between them, with differences due to both algorithmic differences and database composition. AMRFinder missed 16 loci that ResFinder found, while ResFinder missed 216 loci that AMRFinder identified. Based on these results, AMRFinder appears to be a highly accurate AMR gene detection system.
RESUMO
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Bovinos , Perfilação da Expressão Gênica , Genoma Fúngico , Genoma Humano , Genoma Microbiano , Genoma de Planta , Genoma Viral , Genômica/normas , Humanos , Invertebrados/genética , Camundongos , Anotação de Sequência Molecular , Nematoides/genética , Filogenia , RNA Longo não Codificante/genética , Ratos , Padrões de Referência , Análise de Sequência de Proteína , Análise de Sequência de RNA , Vertebrados/genéticaRESUMO
NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10,000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30,000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.
Assuntos
Bases de Dados de Ácidos Nucleicos , Genoma Arqueal , Genoma Bacteriano , Internet , Anotação de Sequência MolecularRESUMO
The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.
Assuntos
Bases de Dados Genéticas , Genes , Variação Genética , Genômica , Internet , National Library of Medicine (U.S.) , Fenótipo , Estados UnidosRESUMO
In 1994, analyses of clostridial 16S rRNA gene sequences led to the assignment of 18 species to Clostridium cluster XI, separating them from Clostridium sensu stricto (Clostridium cluster I). Subsequently, most cluster XI species have been assigned to the family Peptostreptococcaceae with some species being reassigned to new genera. However, several misclassified Clostridium species remained, creating a taxonomic conundrum and confusion regarding their status. Here, we have re-examined the phylogeny of cluster XI species by comparing the 16S rRNA gene-based trees with protein- and genome-based trees, where available. The resulting phylogeny of the Peptostreptococcaceae was consistent with the recent proposals on creating seven new genera within this family. This analysis also revealed a tight clustering of Clostridium litorale and Eubacterium acidaminophilum. Based on these data, we propose reassigning these two organisms to the new genus Peptoclostridium as Peptoclostridium litorale gen. nov. comb. nov. (the type species of the genus) and Peptoclostridium acidaminophilum comb. nov., respectively. As correctly noted in the original publications, the genera Acetoanaerobium and Proteocatella also fall within cluster XI, and can be assigned to the Peptostreptococcaceae. Clostridium sticklandii, which falls within radiation of genus Acetoanaerobium, is proposed to be reclassified as Acetoanaerobium sticklandii comb. nov. The remaining misnamed members of the Peptostreptococcaceae, [Clostridium] hiranonis, [Clostridium] paradoxum and [Clostridium] thermoalcaliphilum, still remain to be properly classified.
Assuntos
Clostridium/classificação , Eubacterium/classificação , Filogenia , Técnicas de Tipagem Bacteriana , Composição de Bases , DNA Bacteriano/genética , RNA Ribossômico 16S/genética , Análise de Sequência de DNARESUMO
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
Assuntos
Bases de Dados Genéticas , Genoma Microbiano , Anotação de Sequência Molecular , Proteínas de Bactérias/genética , Genoma Bacteriano , Genômica/normas , Internet , Padrões de ReferênciaRESUMO
Viruses with double-stranded (ds) DNA genomes in the realm Duplodnaviria share a conserved structural gene module but show a broad range of variation in their repertoires of DNA replication proteins. Some of the duplodnaviruses encode (nearly) complete replication systems whereas others lack (almost) all genes required for replication, relying on the host replication machinery. DNA polymerases (DNAPs) comprise the centerpiece of the DNA replication apparatus. The replicative DNAPs are classified into 4 unrelated or distantly related families (A-D), with the protein structures and sequences within each family being, generally, highly conserved. More than half of the duplodnaviruses encode a DNAP of family A, B or C. We showed previously that multiple pairs of closely related viruses in the order Crassvirales encode DNAPs of different families. Here we identify four additional groups of tailed phages in the class Caudoviricetes in which the DNAPs apparently were swapped on multiple occasions, with replacements occurring both between families A and B, or A and C, or between distinct subfamilies within the same family. The DNAP swapping always occurs "in situ", without changes in the organization of the surrounding genes. In several cases, the DNAP gene is the only region of substantial divergence between closely related phage genomes, whereas in others, the swap apparently involved neighboring genes encoding other proteins involved in phage replication. We hypothesize that DNAP swapping is driven by selection for avoidance of host antiphage mechanisms targeting the phage DNAP that remain to be identified, and/or by selection against replicon incompatibility. In addition, we identified two previously undetected, highly divergent groups of family A DNAPs that are encoded in some phage genomes along with the main DNAP implicated in genome replication.
RESUMO
Human microbiomes are essential to health throughout the lifespan and are increasingly recognized and studied for their roles in metabolic, immunological and neurological processes. Although the full complexity of these microbial communities is not fully understood, their clinical and industrial exploitation is well advanced and expanding, needing greater oversight guided by a consensus from the research community. One of the most controversial issues in microbiome research is the definition of a 'healthy' human microbiome. This concept is complicated by the microbial variability over different spatial and temporal scales along with the challenge of applying a unified definition to the spectrum of healthy microbiome configurations. In this Perspective, we examine the progress made and the key gaps that remain to be addressed to fully harness the benefits of the human microbiome. We propose a road map to expand our knowledge of the microbiome-health relationship, incorporating epidemiological approaches informed by the unique ecological characteristics of these communities.
RESUMO
Hydrocephalus, the leading indication for childhood neurosurgery worldwide, is particularly prevalent in low- and middle-income countries. Hydrocephalus preceded by an infection, or postinfectious hydrocephalus, accounts for up to 60% of hydrocephalus in these areas. Since many children with hydrocephalus suffer poor long-term outcomes despite surgical intervention, prevention of hydrocephalus remains paramount. Our previous studies implicated a novel bacterial pathogen, Paenibacillus thiaminolyticus, as a causal agent of neonatal sepsis and postinfectious hydrocephalus in Uganda. Here, we report the isolation of three P. thiaminolyticus strains, Mbale, Mbale2, and Mbale3, from patients with postinfectious hydrocephalus. We constructed complete genome assemblies of the clinical isolates as well as the nonpathogenic P. thiaminolyticus reference strain and performed comparative genomic and proteomic analyses to identify potential virulence factors. All three isolates carry a unique beta-lactamase gene, and two of the three isolates exhibit resistance in culture to the beta-lactam antibiotics penicillin and ampicillin. In addition, a cluster of genes carried on a mobile genetic element that encodes a putative type IV pilus operon is present in all three clinical isolates but absent in the reference strain. CRISPR-mediated deletion of the gene cluster substantially reduced the virulence of the Mbale strain in mice. Comparative proteogenomic analysis identified various additional potential virulence factors likely acquired on mobile genetic elements in the virulent strains. These results provide insight into the emergence of virulence in P. thiaminolyticus and suggest avenues for the diagnosis and treatment of this novel bacterial pathogen. IMPORTANCE Postinfectious hydrocephalus, a devastating sequela of neonatal infection, is associated with increased childhood mortality and morbidity. A novel bacterial pathogen, Paenibacillus thiaminolyticus, is highly associated with postinfectious hydrocephalus in an African cohort. Whole-genome sequencing, RNA sequencing, and proteomics of clinical isolates and a reference strain in combination with CRISPR editing identified type IV pili as a critical virulence factor for P. thiaminolyticus infection. Acquisition of a type IV pilus-encoding mobile genetic element critically contributed to converting a nonpathogenic strain of P. thiaminolyticus into a pathogen capable of causing devastating diseases. Given the widespread presence of type IV pilus in pathogens, the presence of the type IV pilus operon could serve as a diagnostic and therapeutic target in P. thiaminolyticus and related bacteria.
Assuntos
Proteômica , Fatores de Virulência , Camundongos , Animais , Fatores de Virulência/genética , Uganda , Fímbrias Bacterianas/genéticaRESUMO
Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.
Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Análise por Conglomerados , Genômica , Proteínas/química , Proteínas/genética , Homologia de Sequência de AminoácidosRESUMO
All sequencing projects of bacteriophages (phages) should seek to report an accurate and comprehensive annotation of their genomes. This article defines 14 questions for those new to phage genomics that should be addressed before submitting a genome sequence to the International Nucleotide Sequence Database Collaboration or writing a publication.
RESUMO
CrAssphage is the most abundant human-associated virus and the founding member of a large group of bacteriophages, discovered in animal-associated and environmental metagenomes, that infect bacteria of the phylum Bacteroidetes. We analyze 4907 Circular Metagenome Assembled Genomes (cMAGs) of putative viruses from human gut microbiomes and identify nearly 600 genomes of crAss-like phages that account for nearly 87% of the DNA reads mapped to these cMAGs. Phylogenetic analysis of conserved genes demonstrates the monophyly of crAss-like phages, a putative virus order, and of 5 branches, potential families within that order, two of which have not been identified previously. The phage genomes in one of these families are almost twofold larger than the crAssphage genome (145-192 kilobases), with high density of self-splicing introns and inteins. Many crAss-like phages encode suppressor tRNAs that enable read-through of UGA or UAG stop-codons, mostly, in late phage genes. A distinct feature of the crAss-like phages is the recurrent switch of the phage DNA polymerase type between A and B families. Thus, comparative genomic analysis of the expanded assemblage of crAss-like phages reveals aspects of genome architecture and expression as well as phage biology that were not apparent from the previous work on phage genomics.
Assuntos
Bacteriófagos/genética , Microbioma Gastrointestinal/genética , Genoma Viral , Metagenoma , Códon/genética , Sequência Conservada , DNA Polimerase Dirigida por DNA/metabolismo , Humanos , Inteínas , Íntrons/genética , Fases de Leitura Aberta/genética , Filogenia , Splicing de RNA/genética , Transcrição Gênica , Viroma/genéticaRESUMO
Type IV CRISPR-Cas are a distinct variety of highly derived CRISPR-Cas systems that appear to have evolved from type III systems through the loss of the target-cleaving nuclease and partial deterioration of the large subunit of the effector complex. All known type IV CRISPR-Cas systems are encoded on plasmids, integrative and conjugative elements (ICEs), or prophages, and are thought to contribute to competition between these elements, although the mechanistic details of their function remain unknown. There is a clear parallel between the compositions and likely origin of type IV and type I systems recruited by Tn7-like transposons and mediating RNA-guided transposition. We investigated the diversity and evolutionary relationships of type IV systems, with a focus on those in Acidithiobacillia, where this variety of CRISPR is particularly abundant and always found on ICEs. Our analysis revealed remarkable evolutionary plasticity of type IV CRISPR-Cas systems, with adaptation and ancillary genes originating from different ancestral CRISPR-Cas varieties, and extensive gene shuffling within the type IV loci. The adaptation module and the CRISPR array apparently were lost in the type IV ancestor but were subsequently recaptured by type IV systems on several independent occasions. We demonstrate a high level of heterogeneity among the repeats with type IV CRISPR arrays, which far exceed the heterogeneity of any other known CRISPR repeats and suggest a unique adaptation mechanism. The spacers in the type IV arrays, for which protospacers could be identified, match plasmid genes, in particular those encoding the conjugation apparatus components. Both the biochemical mechanism of type IV CRISPR-Cas function and their role in the competition among mobile genetic elements remain to be investigated.
Assuntos
Sistemas CRISPR-Cas/genética , Evolução Molecular , Proteobactérias/genética , Genes Bacterianos , Filogenia , Polimorfismo Genético , Proteobactérias/classificaçãoRESUMO
While taxonomy is an often-unappreciated branch of science it serves very important roles. Bacteriophage taxonomy has evolved from a mainly morphology-based discipline, characterized by the work of David Bradley and Hans-Wolfgang Ackermann, to the holistic approach that is taken today. The Bacterial and Archaeal Viruses Subcommittee of the International Committee on Taxonomy of Viruses (ICTV) takes a comprehensive approach to classifying prokaryote viruses measuring overall DNA and protein identity and phylogeny before making decisions about the taxonomic position of a new virus. The huge number of complete genomes being deposited with NCBI and other public databases has resulted in a reassessment of the taxonomy of many viruses, and the future will see the introduction of new viral families and higher orders.