RESUMO
The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Assuntos
Genoma Bacteriano , Genômica , Análise por Conglomerados , Tipagem de Sequências Multilocus , FilogeniaRESUMO
Salmonella enterica serovar Typhimurium strain ATCC14028s is commercially available from multiple national type culture collections, and has been widely used since 1960 for quality control of growth media and experiments on fitness ("laboratory evolution"). ATCC14028s has been implicated in multiple cross-contaminations in the laboratory, and has also caused multiple laboratory infections and one known attempt at bioterrorism. According to hierarchical clustering of 3002 core gene sequences, ATCC14028s belongs to HierCC cluster HC20_373 in which most internal branch lengths are only one to three SNPs long. Many natural Typhimurium isolates from humans, domesticated animals and the environment also belong to HC20_373, and their core genomes are almost indistinguishable from those of laboratory strains. These natural isolates have infected humans in Ireland and Taiwan for decades, and are common in the British Isles as well as the Americas. The isolation history of some of the natural isolates confirms the conclusion that they do not represent recent contamination by the laboratory strain, and 10% carry plasmids or bacteriophages which have been acquired in nature by HGT from unrelated bacteria. We propose that ATCC14028s has repeatedly escaped from the laboratory environment into nature via laboratory accidents or infections, but the escaped micro-lineages have only a limited life span. As a result, there is a genetic gap separating HC20_373 from its closest natural relatives due to a divergence between them in the late 19th century followed by repeated extinction events of escaped HC20_373.
Assuntos
Genoma Bacteriano , Laboratórios , Salmonella enterica/genética , Teorema de Bayes , Bioterrorismo , Bases de Dados Genéticas , Evolução Molecular , Funções Verossimilhança , Filogenia , Salmonella enterica/classificaçãoRESUMO
The gastric bacterium Helicobacter pylori shares a coevolutionary history with humans that predates the out-of-Africa diaspora, and the geographical specificities of H. pylori populations reflect multiple well-known human migrations. We extensively sampled H. pylori from 16 ethnically diverse human populations across Siberia to help resolve whether ancient northern Eurasian populations persisted at high latitudes through the last glacial maximum and the relationships between present-day Siberians and Native Americans. A total of 556 strains were cultivated and genotyped by multilocus sequence typing, and 54 representative draft genomes were sequenced. The genetic diversity across Eurasia and the Americas was structured into three populations: hpAsia2, hpEastAsia, and hpNorthAsia. hpNorthAsia is closely related to the subpopulation hspIndigenousAmericas from Native Americans. Siberian bacteria were structured into five other subpopulations, two of which evolved through a divergence from hpAsia2 and hpNorthAsia, while three originated though Holocene admixture. The presence of both anciently diverged and recently admixed strains across Siberia support both Pleistocene persistence and Holocene recolonization. We also show that hspIndigenousAmericas is endemic in human populations across northern Eurasia. The evolutionary history of hspIndigenousAmericas was reconstructed using approximate Bayesian computation, which showed that it colonized the New World in a single migration event associated with a severe demographic bottleneck followed by low levels of recent admixture across the Bering Strait.
Assuntos
Migração Animal/fisiologia , Helicobacter pylori/fisiologia , América , Evolução Biológica , Genoma Bacteriano , Geografia , Helicobacter pylori/classificação , Helicobacter pylori/genética , Humanos , Modelos Biológicos , Tipagem de Sequências Multilocus , SibériaRESUMO
MOTIVATION: Routine infectious disease surveillance is increasingly based on large-scale whole-genome sequencing databases. Real-time surveillance would benefit from immediate assignments of each genome assembly to hierarchical population structures. Here we present pHierCC, a pipeline that defines a scalable clustering scheme, HierCC, based on core genome multi-locus typing that allows incremental, static, multi-level cluster assignments of genomes. We also present HCCeval, which identifies optimal thresholds for assigning genomes to cohesive HierCC clusters. HierCC was implemented in EnteroBase in 2018 and has since genotyped >530 000 genomes from Salmonella, Escherichia/Shigella, Streptococcus, Clostridioides, Vibrio and Yersinia. AVAILABILITY AND IMPLEMENTATION: https://enterobase.warwick.ac.uk/ and Source code and instructions: https://github.com/zheminzhou/pHierCC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
BlastFrost is a highly efficient method for querying 100,000s of genome assemblies, building on Bifrost, a dynamic data structure for compacted and colored de Bruijn graphs. BlastFrost queries a Bifrost data structure for sequences of interest and extracts local subgraphs, enabling the identification of the presence or absence of individual genes or single nucleotide sequence variants. We show two examples using Salmonella genomes: finding within minutes the presence of genes in the SPI-2 pathogenicity island in a collection of 926 genomes and identifying single nucleotide polymorphisms associated with fluoroquinolone resistance in three genes among 190,209 genomes. BlastFrost is available at https://github.com/nluhmann/BlastFrost/tree/master/data .
Assuntos
Bactérias/genética , Proteínas de Bactérias/genética , Genoma Bacteriano , Genômica/métodos , Algoritmos , Ilhas Genômicas , Humanos , Proteínas de Membrana/genética , Polimorfismo de Nucleotídeo Único , Salmonella/genéticaRESUMO
[This corrects the article DOI: 10.1371/journal.ppat.1002776.].
RESUMO
We have recently developed bioinformatic tools to accurately assign metagenomic sequence reads to microbial taxa: SPARSE for probabilistic, taxonomic classification of sequence reads; EToKi for assembling and polishing genomes from short-read sequences; and GrapeTree, a graphic visualizer of genetic distances between large numbers of genomes. Together, these methods support comparative analyses of genomes from ancient skeletons and modern humans. Here, we illustrate these capabilities with 784 samples from historical dental calculus, modern saliva and modern dental plaque. The analyses revealed 1591 microbial species within the oral microbiome. We anticipated that the oral complexes of Socransky et al., which were defined in 1998, would predominate among taxa whose frequencies differed by source. However, although some species discriminated between sources, we could not confirm the existence of the complexes. The results also illustrate further functionality of our pipelines with two species that are associated with dental caries, Streptococcus mutans and Streptococcus sobrinus. They were rare in historical dental calculus but common in modern plaque, and even more common in saliva. Reconstructed draft genomes of these two species from metagenomic samples in which they were abundant were combined with modern public genomes to provide a detailed overview of their core genomic diversity. This article is part of the theme issue 'Insights into health and disease from ancient biomolecules'.
Assuntos
Cárie Dentária/história , Cárie Dentária/microbiologia , Metagenoma , Microbiota , Boca/microbiologia , Streptococcus mutans/genética , Streptococcus sobrinus/genética , História do Século XV , História do Século XVI , História do Século XVII , História do Século XVIII , História do Século XIX , História do Século XX , História Antiga , História Medieval , Humanos , Filogenia , Saliva/microbiologia , Streptococcus mutans/classificação , Streptococcus sobrinus/classificaçãoRESUMO
Bacterial genomes can contain traces of a complex evolutionary history, including extensive homologous recombination, gene loss, gene duplications, and horizontal gene transfer. To reconstruct the phylogenetic and population history of a set of multiple bacteria, it is necessary to examine their pangenome, the composite of all the genes in the set. Here we introduce PEPPAN, a novel pipeline that can reliably construct pangenomes from thousands of genetically diverse bacterial genomes that represent the diversity of an entire genus. PEPPAN outperforms existing pangenome methods by providing consistent gene and pseudogene annotations extended by similarity-based gene predictions, and identifying and excluding paralogs by combining tree- and synteny-based approaches. The PEPPAN package additionally includes PEPPAN_parser, which implements additional downstream analyses, including the calculation of trees based on accessory gene content or allelic differences between core genes. To test the accuracy of PEPPAN, we implemented SimPan, a novel pipeline for simulating the evolution of bacterial pangenomes. We compared the accuracy and speed of PEPPAN with four state-of-the-art pangenome pipelines using both empirical and simulated data sets. PEPPAN was more accurate and more specific than any of the other pipelines and was almost as fast as any of them. As a case study, we used PEPPAN to construct a pangenome of approximately 40,000 genes from 3052 representative genomes spanning at least 80 species of Streptococcus The resulting gene and allelic trees provide an unprecedented overview of the genomic diversity of the entire Streptococcus genus.
Assuntos
Bactérias/classificação , Genoma Bacteriano , Genômica/métodos , Filogenia , Algoritmos , Genes Bacterianos , Pseudogenes , Software , Streptococcus/classificação , Streptococcus/genéticaRESUMO
Clostridioides difficile is the primary infectious cause of antibiotic-associated diarrhea. Local transmissions and international outbreaks of this pathogen have been previously elucidated by bacterial whole-genome sequencing, but comparative genomic analyses at the global scale were hampered by the lack of specific bioinformatic tools. Here we introduce a publicly accessible database within EnteroBase (http://enterobase.warwick.ac.uk) that automatically retrieves and assembles C. difficile short-reads from the public domain, and calls alleles for core-genome multilocus sequence typing (cgMLST). We demonstrate that comparable levels of resolution and precision are attained by EnteroBase cgMLST and single-nucleotide polymorphism analysis. EnteroBase currently contains 18â254 quality-controlled C. difficile genomes, which have been assigned to hierarchical sets of single-linkage clusters by cgMLST distances. This hierarchical clustering is used to identify and name populations of C. difficile at all epidemiological levels, from recent transmission chains through to epidemic and endemic strains. Moreover, it puts newly collected isolates into phylogenetic and epidemiological context by identifying related strains among all previously published genome data. For example, HC2 clusters (i.e. chains of genomes with pairwise distances of up to two cgMLST alleles) were statistically associated with specific hospitals (P<10-4) or single wards (P=0.01) within hospitals, indicating they represented local transmission clusters. We also detected several HC2 clusters spanning more than one hospital that by retrospective epidemiological analysis were confirmed to be associated with inter-hospital patient transfers. In contrast, clustering at level HC150 correlated with k-mer-based classification and was largely compatible with PCR ribotyping, thus enabling comparisons to earlier surveillance data. EnteroBase enables contextual interpretation of a growing collection of assembled, quality-controlled C. difficile genome sequences and their associated metadata. Hierarchical clustering rapidly identifies database entries that are related at multiple levels of genetic distance, facilitating communication among researchers, clinicians and public-health officials who are combatting disease caused by C. difficile.
Assuntos
Clostridioides difficile/genética , Infecções por Clostridium , Bases de Dados Genéticas , Mapeamento Cromossômico , Infecções por Clostridium/epidemiologia , Infecções por Clostridium/microbiologia , Infecções por Clostridium/transmissão , Surtos de Doenças , Genoma Bacteriano , Humanos , Filogenia , Estudos RetrospectivosRESUMO
It has been hypothesized that the Neolithic transition towards an agricultural and pastoralist economy facilitated the emergence of human-adapted pathogens. Here, we recovered eight Salmonella enterica subsp. enterica genomes from human skeletons of transitional foragers, pastoralists and agropastoralists in western Eurasia that were up to 6,500 yr old. Despite the high genetic diversity of S. enterica, all ancient bacterial genomes clustered in a single previously uncharacterized branch that contains S. enterica adapted to multiple mammalian species. All ancient bacterial genomes from prehistoric (agro-)pastoralists fall within a part of this branch that also includes the human-specific S. enterica Paratyphi C, illustrating the evolution of a human pathogen over a period of 5,000 yr. Bacterial genomic comparisons suggest that the earlier ancient strains were not host specific, differed in pathogenic potential and experienced convergent pseudogenization that accompanied their downstream host adaptation. These observations support the concept that the emergence of human-adapted S. enterica is linked to human cultural transformations.
Assuntos
Salmonella enterica , Animais , Genoma Bacteriano , HumanosRESUMO
EnteroBase is an integrated software environment that supports the identification of global population structures within several bacterial genera that include pathogens. Here, we provide an overview of how EnteroBase works, what it can do, and its future prospects. EnteroBase has currently assembled more than 300,000 genomes from Illumina short reads from Salmonella, Escherichia, Yersinia, Clostridioides, Helicobacter, Vibrio, and Moraxella and genotyped those assemblies by core genome multilocus sequence typing (cgMLST). Hierarchical clustering of cgMLST sequence types allows mapping a new bacterial strain to predefined population structures at multiple levels of resolution within a few hours after uploading its short reads. Case Study 1 illustrates this process for local transmissions of Salmonella enterica serovar Agama between neighboring social groups of badgers and humans. EnteroBase also supports single nucleotide polymorphism (SNP) calls from both genomic assemblies and after extraction from metagenomic sequences, as illustrated by Case Study 2 which summarizes the microevolution of Yersinia pestis over the last 5000 years of pandemic plague. EnteroBase can also provide a global overview of the genomic diversity within an entire genus, as illustrated by Case Study 3, which presents a novel, global overview of the population structure of all of the species, subspecies, and clades within Escherichia.
Assuntos
Bases de Dados Genéticas , Escherichia/genética , Genoma Bacteriano , Genômica , Salmonella/genética , Yersinia pestis/genética , Escherichia/classificação , Genômica/métodos , Metagenoma , Metagenômica/métodos , Tipagem de Sequências Multilocus , Filogenia , Salmonella/classificação , Software , Interface Usuário-Computador , Navegador , Yersinia pestis/classificaçãoRESUMO
Background: Most publicly available genomes of Salmonella enterica are from human disease in the US and the UK, or from domesticated animals in the US. Methods: Here we describe a historical collection of 10,000 strains isolated between 1891-2010 in 73 different countries. They encompass a broad range of sources, ranging from rivers through reptiles to the diversity of all S. enterica isolated on the island of Ireland between 2000 and 2005. Genomic DNA was isolated, and sequenced by Illumina short read sequencing. Results: The short reads are publicly available in the Short Reads Archive. They were also uploaded to EnteroBase, which assembled and annotated draft genomes. 9769 draft genomes which passed quality control were genotyped with multiple levels of multilocus sequence typing, and used to predict serovars. Genomes were assigned to hierarchical clusters on the basis of numbers of pair-wise allelic differences in core genes, which were mapped to genetic Lineages within phylogenetic trees. Conclusions: The University of Warwick/University College Cork (UoWUCC) project greatly extends the geographic sources, dates and core genomic diversity of publicly available S. enterica genomes. We illustrate these features by an overview of core genomic Lineages within 33,000 publicly available Salmonella genomes whose strains were isolated before 2011. We also present detailed examinations of HC400, HC900 and HC2000 hierarchical clusters within exemplar Lineages, including serovars Typhimurium, Enteritidis and Mbandaka. These analyses confirm the polyphyletic nature of multiple serovars while showing that discrete clusters with geographical specificity can be reliably recognized by hierarchical clustering approaches. The results also demonstrate that the genomes sequenced here provide an important counterbalance to the sampling bias which is so dominant in current genomic sequencing.
RESUMO
This month: selected work from the 2018 RECOMB meeting, organized by Ecole Polytechnique and held last April in Paris.
RESUMO
Salmonella enterica serovar Paratyphi C causes enteric (paratyphoid) fever in humans. Its presentation can range from asymptomatic infections of the blood stream to gastrointestinal or urinary tract infection or even a fatal septicemia [1]. Paratyphi C is very rare in Europe and North America except for occasional travelers from South and East Asia or Africa, where the disease is more common [2, 3]. However, early 20th-century observations in Eastern Europe [3, 4] suggest that Paratyphi C enteric fever may once have had a wide-ranging impact on human societies. Here, we describe a draft Paratyphi C genome (Ragna) recovered from the 800-year-old skeleton (SK152) of a young woman in Trondheim, Norway. Paratyphi C sequences were recovered from her teeth and bones, suggesting that she died of enteric fever and demonstrating that these bacteria have long caused invasive salmonellosis in Europeans. Comparative analyses against modern Salmonella genome sequences revealed that Paratyphi C is a clade within the Para C lineage, which also includes serovars Choleraesuis, Typhisuis, and Lomita. Although Paratyphi C only infects humans, Choleraesuis causes septicemia in pigs and boar [5] (and occasionally humans), and Typhisuis causes epidemic swine salmonellosis (chronic paratyphoid) in domestic pigs [2, 3]. These different host specificities likely evolved in Europe over the last â¼4,000 years since the time of their most recent common ancestor (tMRCA) and are possibly associated with the differential acquisitions of two genomic islands, SPI-6 and SPI-7. The tMRCAs of these bacterial clades coincide with the timing of pig domestication in Europe [6].
Assuntos
DNA Antigo/análise , DNA Bacteriano/análise , Instabilidade Genômica , Salmonella enterica/genética , Febre Tifoide/microbiologia , Feminino , Ilhas Genômicas , Humanos , NoruegaRESUMO
Current methods struggle to reconstruct and visualize the genomic relationships of large numbers of bacterial genomes. GrapeTree facilitates the analyses of large numbers of allelic profiles by a static "GrapeTree Layout" algorithm that supports interactive visualizations of large trees within a web browser window. GrapeTree also implements a novel minimum spanning tree algorithm (MSTree V2) to reconstruct genetic relationships despite high levels of missing data. GrapeTree is a stand-alone package for investigating phylogenetic trees plus associated metadata and is also integrated into EnteroBase to facilitate cutting edge navigation of genomic relationships among bacterial pathogens.
Assuntos
Bactérias/genética , Código de Barras de DNA Taxonômico/métodos , Genoma Bacteriano , Filogenia , Software , Alelos , Bactérias/classificação , Bactérias/patogenicidadeRESUMO
For many decades, Salmonella enterica has been subdivided by serological properties into serovars or further subdivided for epidemiological tracing by a variety of diagnostic tests with higher resolution. Recently, it has been proposed that so-called eBurst groups (eBGs) based on the alleles of seven housekeeping genes (legacy multilocus sequence typing [MLST]) corresponded to natural populations and could replace serotyping. However, this approach lacks the resolution needed for epidemiological tracing and the existence of natural populations had not been independently validated by independent criteria. Here, we describe EnteroBase, a web-based platform that assembles draft genomes from Illumina short reads in the public domain or that are uploaded by users. EnteroBase implements legacy MLST as well as ribosomal gene MLST (rMLST), core genome MLST (cgMLST), and whole genome MLST (wgMLST) and currently contains over 100,000 assembled genomes from Salmonella. It also provides graphical tools for visual interrogation of these genotypes and those based on core single nucleotide polymorphisms (SNPs). eBGs based on legacy MLST are largely consistent with eBGs based on rMLST, thus demonstrating that these correspond to natural populations. rMLST also facilitated the selection of representative genotypes for SNP analyses of the entire breadth of diversity within Salmonella. In contrast, cgMLST provides the resolution needed for epidemiological investigations. These observations show that genomic genotyping, with the assistance of EnteroBase, can be applied at all levels of diversity within the Salmonella genus.
Assuntos
Bases de Dados Genéticas , Genoma Bacteriano , Salmonella/classificação , Salmonella/genética , Tipagem de Sequências Multilocus , Filogenia , Polimorfismo de Nucleotídeo ÚnicoRESUMO
UNLABELLED: For 100 years, it has been obvious that Salmonella enterica strains sharing the serotype with the formula 1,4,[5],12:b:1,2-now known as Paratyphi B-can cause diseases ranging from serious systemic infections to self-limiting gastroenteritis. Despite considerable predicted diversity between strains carrying the common Paratyphi B serotype, there remain few methods that subdivide the group into groups that are congruent with their disease phenotypes. Paratyphi B therefore represents one of the canonical examples in Salmonella where serotyping combined with classical microbiological tests fails to provide clinically informative information. Here, we use genomics to provide the first high-resolution view of this serotype, placing it into a wider genomic context of the Salmonella enterica species. These analyses reveal why it has been impossible to subdivide this serotype based upon phenotypic and limited molecular approaches. By examining the genomic data in detail, we are able to identify common features that correlate with strains of clinical importance. The results presented here provide new diagnostic targets, as well as posing important new questions about the basis for the invasive disease phenotype observed in a subset of strains. IMPORTANCE: Salmonella enterica strains carrying the serotype Paratyphi B have long been known to possess Jekyll and Hyde characteristics; some cause gastroenteritis, while others cause serious invasive disease. Understanding what makes up the population of strains carrying this serotype, as well as the source of their invasive disease, is a 100-year-old puzzle that we address here using genomics. Our analysis provides the first high-resolution view of this serotype, placing strains carrying serotype Paratyphi B into the wider genomic context of the Salmonella enterica species. This work reveals a history of disease dating back to the middle ages, caused by a group of distinct lineages with various abilities to cause invasive disease. By quantifying the key genomic differences between the invasive and noninvasive populations, we are able to identify key virulence-related targets that can form the basis of simple, rapid, point-of-care tests.
Assuntos
Genoma Bacteriano , Genótipo , Salmonella paratyphi B/classificação , Salmonella paratyphi B/genética , Análise de Sequência de DNA , Animais , Análise por Conglomerados , Humanos , Febre Paratifoide/microbiologia , Febre Paratifoide/veterinária , Salmonella paratyphi B/isolamento & purificaçãoRESUMO
Only few molecular studies have addressed the age of bacterial pathogens that infected humans before the beginnings of medical bacteriology, but these have provided dramatic insights. The global genetic diversity of Helicobacter pylori, which infects human stomachs, parallels that of its human host. The time to the most recent common ancestor (tMRCA) of these bacteria approximates that of anatomically modern humans, i.e. at least 100 000 years, after calibrating the evolutionary divergence within H. pylori against major ancient human migrations. Similarly, genomic reconstructions of Mycobacterium tuberculosis, the cause of tuberculosis, from ancient skeletons in South America and mummies in Hungary support estimates of less than 6000 years for the tMRCA of M. tuberculosis Finally, modern global patterns of genetic diversity and ancient DNA studies indicate that during the last 5000 years plague caused by Yersinia pestis has spread globally on multiple occasions from China and Central Asia. Such tMRCA estimates provide only lower bounds on the ages of bacterial pathogens, and additional studies are needed for realistic upper bounds on how long humans and animals have suffered from bacterial diseases.
Assuntos
Evolução Biológica , Genoma Bacteriano , Helicobacter pylori/genética , Mycobacterium tuberculosis/genética , Yersinia pestis/genética , Animais , DNA Bacteriano/análise , HumanosRESUMO
In 2013 Zhou et al. concluded that Salmonella enterica serovar Agona represents a genetically monomorphic lineage of recent ancestry, whose most recent common ancestor existed in 1932, or earlier. The Abstract stated 'Agona consists of three lineages with minimal mutational diversity: only 846 single nucleotide polymorphisms (SNPs) have accumulated in the non-repetitive, core genome since Agona evolved in 1932 and subsequently underwent a major population expansion in the 1960s.' These conclusions have now been criticized by Pettengill, who claims that the evolutionary models used to date Agona may not have been appropriate, the dating estimates were inaccurate, and the age of emergence of Agona should have been qualified by an upper limit reflecting the date of its divergence from an outgroup, serovar Soerenga. We dispute these claims. Firstly, Pettengill's analysis of Agona is not justifiable on technical grounds. Secondly, an upper limit for divergence from an outgroup would only be meaningful if the outgroup were closely related to Agona, but close relatives of Agona are yet to be identified. Thirdly, it is not possible to reliably date the time of divergence between Agona and Soerenga. We conclude that Pettengill's criticism is comparable to a tempest in a teapot.