ABSTRACT
By decomposing genome sequences into k-mers, it is possible to estimate genome differences without alignment. Techniques such as k-mer minimisers, for example MinHash, have been developed and are often accurate approximations of distances based on full k-mer sets. These and other alignment-free methods avoid the large temporal and computational expense of alignment. However, these k-mer set comparisons are not entirely accurate within-species and can be completely inaccurate within-lineage. This is due, in part, to their inability to distinguish core polymorphism from accessory differences. Here we present a new approach, KmerAperture, which uses information on the k-mer relative genomic positions to determine the type of polymorphism causing differences in k-mer presence and absence between pairs of genomes. Single SNPs are expected to result in k unique contiguous k-mers per genome. On the other hand, contiguous series > k may be caused by accessory differences of length S-k+1; when the start and end of the sequence are contiguous with homologous sequence. Alternatively, they may be caused by multiple SNPs within k bp from each other and KmerAperture can determine whether that is the case. To demonstrate use cases KmerAperture was benchmarked using datasets including a very low diversity simulated population with accessory content independent from the number of SNPs, a simulated population where SNPs are spatially dense, a moderately diverse real cluster of genomes (Escherichia coli ST1193) with a large accessory genome and a low diversity real genome cluster (Salmonella Typhimurium ST34). We show that KmerAperture can accurately distinguish both core and accessory sequence diversity without alignment, outperforming other k-mer based tools.
Subject(s)
Genome, Bacterial , Polymorphism, Single Nucleotide , Polymorphism, Single Nucleotide/genetics , Synteny , Genomics/methods , Algorithms , Escherichia coli/genetics , Software , Sequence Alignment/methods , PhylogenyABSTRACT
In recent times, pathogen genome sequencing has become increasingly used to investigate infectious disease outbreaks. When genomic data is sampled densely enough amongst infected individuals, it can help resolve who infected whom. However, transmission analysis cannot rely solely on a phylogeny of the genomes but must account for the within-host evolution of the pathogen, which blurs the relationship between phylogenetic and transmission trees. When only a single genome is sampled for each host, the uncertainty about who infected whom can be quite high. Consequently, transmission analysis based on multiple genomes of the same pathogen per host has a clear potential for delivering more precise results, even though it is more laborious to achieve. Here, we present a new methodology that can use any number of genomes sampled from a set of individuals to reconstruct their transmission network. Furthermore, we remove the need for the assumption of a complete transmission bottleneck. We use simulated data to show that our method becomes more accurate as more genomes per host are provided, and that it can infer key infectious disease parameters such as the size of the transmission bottleneck, within-host growth rate, basic reproduction number, and sampling fraction. We demonstrate the usefulness of our method in applications to real datasets from an outbreak of Pseudomonas aeruginosa amongst cystic fibrosis patients and a nosocomial outbreak of Klebsiella pneumoniae.
Subject(s)
Communicable Diseases , Humans , Phylogeny , Communicable Diseases/genetics , Communicable Diseases/epidemiology , Disease Outbreaks , Genomics , Chromosome Mapping , Disease Transmission, InfectiousABSTRACT
MOTIVATION: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.
Subject(s)
Algorithms , Genome, Bacterial , Recombination, Genetic , Sequence Alignment , Sequence Alignment/methods , Humans , Software , Sequence Analysis, DNA/methods , Genomics/methodsABSTRACT
BACKGROUND: Carbapenemase-producing Enterobacterales (CPE) are challenging in healthcare, with resistance to multiple classes of antibiotics. This study describes the emergence of imipenemase (IMP)-encoding CPE among diverse Enterobacterales species between 2016 and 2019 across a London regional network. METHODS: We performed a network analysis of patient pathways, using electronic health records, to identify contacts between IMP-encoding CPE-positive patients. Genomes of IMP-encoding CPE isolates were overlaid with patient contacts to imply potential transmission events. RESULTS: Genomic analysis of 84 Enterobacterales isolates revealed diverse species (predominantly Klebsiella spp, Enterobacter spp, and Escherichia coli); 86% (72 of 84) harbored an IncHI2 plasmid carrying blaIMP and colistin resistance gene mcr-9 (68 of 72). Phylogenetic analysis of IncHI2 plasmids identified 3 lineages showing significant association with patient contacts and movements between 4 hospital sites and across medical specialties, which was missed in initial investigations. CONCLUSIONS: Combined, our patient network and plasmid analyses demonstrate an interspecies, plasmid-mediated outbreak of blaIMPCPE, which remained unidentified during standard investigations. With DNA sequencing and multimodal data incorporation, the outbreak investigation approach proposed here provides a framework for real-time identification of key factors causing pathogen spread. Plasmid-level outbreak analysis reveals that resistance spread may be wider than suspected, allowing more interventions to stop transmission within hospital networks.SummaryThis was an investigation, using integrated pathway networks and genomics methods, of the emergence of imipenemase-encoding carbapenemase-producing Enterobacterales among diverse Enterobacterales species between 2016 and 2019 in patients across a London regional hospital network, which was missed on routine investigations.
Subject(s)
Bacterial Proteins , Disease Outbreaks , Enterobacteriaceae Infections , Plasmids , beta-Lactamases , Humans , Plasmids/genetics , beta-Lactamases/genetics , Enterobacteriaceae Infections/epidemiology , Enterobacteriaceae Infections/microbiology , Enterobacteriaceae Infections/transmission , Bacterial Proteins/genetics , London/epidemiology , Anti-Bacterial Agents/pharmacology , Phylogeny , Genome, Bacterial , Male , Female , Middle Aged , Microbial Sensitivity Tests , Adult , Enterobacteriaceae/genetics , Enterobacteriaceae/drug effects , Aged , Carbapenem-Resistant Enterobacteriaceae/genetics , Carbapenem-Resistant Enterobacteriaceae/isolation & purification , Colistin/pharmacologyABSTRACT
Since the start of the SARS-CoV-2 pandemic in late 2019, several variants of concern (VOC) have been reported to have increased transmissibility. In addition, despite the progress of vaccination against SARS-CoV-2 worldwide, all vaccines currently in used are known to protect only partially from infection and onward transmission. We combined phylogenetic analysis with Bayesian inference under an epidemiological model to infer the reproduction number (Rt) and also trace person-to-person transmission. We examined the impact of phylogenetic uncertainty and sampling bias on the estimation. Our result indicated that lineage B had a significantly higher transmissibility than lineage A and contributed to the global pandemic to a large extent. In addition, although the transmissibility of VOCs is higher than other exponentially growing lineages, this difference is not very high. The probability of detecting onward transmission from patients infected with SARS-CoV-2 VOCs who had received at least one dose of vaccine was approximate 1.06% (3/284), which was slightly lower but not statistically significantly different from a probability of 1.21% (10/828) for unvaccinated individuals. In addition to VOCs, exponentially growing lineages in each country should also be account for when tailoring prevention and control strategies. One dose of vaccination could not efficiently prevent the onward transmission of SARS-CoV-2 VOCs. Consequently, nonpharmaceutical interventions (such as wearing masks and social distancing) should still be implemented in each country during the vaccination period.
Subject(s)
COVID-19/transmission , COVID-19/virology , SARS-CoV-2/classification , SARS-CoV-2/genetics , COVID-19/epidemiology , COVID-19/prevention & control , COVID-19 Vaccines , Evolution, Molecular , Genome, Viral , Global Health , Humans , Phylogeny , Public Health Surveillance , SARS-CoV-2/immunology , VaccinationABSTRACT
MOTIVATION: The ability to distinguish imported cases from locally acquired cases has important consequences for the selection of public health control strategies. Genomic data can be useful for this, for example, using a phylogeographic analysis in which genomic data from multiple locations are compared to determine likely migration events between locations. However, these methods typically require good samples of genomes from all locations, which is rarely available. RESULTS: Here, we propose an alternative approach that only uses genomic data from a location of interest. By comparing each new case with previous cases from the same location, we are able to detect imported cases, as they have a different genealogical distribution than that of locally acquired cases. We show that, when variations in the size of the local population are accounted for, our method has good sensitivity and excellent specificity for the detection of imports. We applied our method to data simulated under the structured coalescent model and demonstrate relatively good performance even when the local population has the same size as the external population. Finally, we applied our method to several recent genomic datasets from both bacterial and viral pathogens, and show that it can, in a matter of seconds or minutes, deliver important insights on the number of imports to a geographically limited sample of a pathogen population. AVAILABILITY AND IMPLEMENTATION: The R package DetectImports is freely available from https://github.com/xavierdidelot/DetectImports. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Communicable Diseases , Software , Humans , Genomics/methods , Genome , Phylogeography , Communicable Diseases/diagnosis , Communicable Diseases/geneticsABSTRACT
Microbial population genetics models often assume that all lineages are constrained by the same population size dynamics over time. However, many neutral and selective events can invalidate this assumption and can contribute to the clonal expansion of a specific lineage relative to the rest of the population. Such differential phylodynamic properties between lineages result in asymmetries and imbalances in phylogenetic trees that are sometimes described informally but which are difficult to analyze formally. To this end, we developed a model of how clonal expansions occur and affect the branching patterns of a phylogeny. We show how the parameters of this model can be inferred from a given dated phylogeny using Bayesian statistics, which allows us to assess the probability that one or more clonal expansion events occurred. For each putative clonal expansion event, we estimate its date of emergence and subsequent phylodynamic trajectory, including its long-term evolutionary potential which is important to determine how much effort should be placed on specific control measures. We demonstrate the applicability of our methodology on simulated and real data sets. Inference under our clonal expansion model can reveal important features in the evolution and epidemiology of infectious disease pathogens. [Clonal expansion; genomic epidemiology; microbial population genomics; phylodynamics.].
Subject(s)
Genetics, Population , Genomics , Bayes Theorem , Phylogeny , ProbabilityABSTRACT
Phylogenetic dating is one of the most powerful and commonly used methods of drawing epidemiological interpretations from pathogen genomic data. Building such trees requires considering a molecular clock model which represents the rate at which substitutions accumulate on genomes. When the molecular clock rate is constant throughout the tree then the clock is said to be strict, but this is often not an acceptable assumption. Alternatively, relaxed clock models consider variations in the clock rate, often based on a distribution of rates for each branch. However, we show here that the distributions of rates across branches in commonly used relaxed clock models are incompatible with the biological expectation that the sum of the numbers of substitutions on two neighboring branches should be distributed as the substitution number on a single branch of equivalent length. We call this expectation the additivity property. We further show how assumptions of commonly used relaxed clock models can lead to estimates of evolutionary rates and dates with low precision and biased confidence intervals. We therefore propose a new additive relaxed clock model where the additivity property is satisfied. We illustrate the use of our new additive relaxed clock model on a range of simulated and real data sets, and we show that using this new model leads to more accurate estimates of mean evolutionary rates and ancestral dates.
Subject(s)
Evolution, Molecular , Genome, Bacterial , Models, Genetic , Phylogeny , MutationABSTRACT
The coalescent model represents how individuals sampled from a population may have originated from a last common ancestor. The bounded coalescent model is obtained by conditioning the coalescent model such that the last common ancestor must have existed after a certain date. This conditioned model arises in a variety of applications, such as speciation, horizontal gene transfer or transmission analysis, and yet the bounded coalescent model has not been previously analysed in detail. Here we describe a new algorithm to simulate from this model directly, without resorting to rejection sampling. We show that this direct simulation algorithm is more computationally efficient than the rejection sampling approach. We also show how to calculate the probability of the last common ancestor occurring after a given date, which is required to compute the probability density of realisations under the bounded coalescent model. Our results are applicable in both the isochronous (when all samples have the same date) and heterochronous (where samples can have different dates) settings. We explore the effect of setting a bound on the date of the last common ancestor, and show that it affects a number of properties of the resulting phylogenies. All our methods are implemented in a new R package called BoundedCoalescent which is freely available online.
Subject(s)
Algorithms , Models, Genetic , Computer Simulation , Genetics, Population , Humans , Phylogeny , ProbabilityABSTRACT
Population structure influences genealogical patterns, however, data pertaining to how populations are structured are often unavailable or not directly observable. Inference of population structure is highly important in molecular epidemiology where pathogen phylogenetics is increasingly used to infer transmission patterns and detect outbreaks. Discrepancies between observed and idealized genealogies, such as those generated by the coalescent process, can be quantified, and where significant differences occur, may reveal the action of natural selection, host population structure, or other demographic and epidemiological heterogeneities. We have developed a fast non-parametric statistical test for detection of cryptic population structure in time-scaled phylogenetic trees. The test is based on contrasting estimated phylogenies with the theoretically expected phylodynamic ordering of common ancestors in two clades within a coalescent framework. These statistical tests have also motivated the development of algorithms which can be used to quickly screen a phylogenetic tree for clades which are likely to share a distinct demographic or epidemiological history. Epidemiological applications include identification of outbreaks in vulnerable host populations or rapid expansion of genotypes with a fitness advantage. To demonstrate the utility of these methods for outbreak detection, we applied the new methods to large phylogenies reconstructed from thousands of HIV-1 partial pol sequences. This revealed the presence of clades which had grown rapidly in the recent past and was significantly concentrated in young men, suggesting recent and rapid transmission in that group. Furthermore, to demonstrate the utility of these methods for the study of antimicrobial resistance, we applied the new methods to a large phylogeny reconstructed from whole genome Neisseria gonorrhoeae sequences. We find that population structure detected using these methods closely overlaps with the appearance and expansion of mutations conferring antimicrobial resistance. [Antimicrobial resistance; coalescent; HIV; population structure.].
Subject(s)
Molecular Epidemiology/methods , Phylogeny , Drug Resistance, Bacterial/genetics , Genome, Bacterial/genetics , HIV-1/classification , HIV-1/genetics , Humans , Male , Neisseria gonorrhoeae/classification , Neisseria gonorrhoeae/drug effects , Neisseria gonorrhoeae/genetics , Time , pol Gene Products, Human Immunodeficiency Virus/geneticsABSTRACT
BACKGROUND: Gonorrhea incidence is increasing rapidly in many countries, while antibiotic resistance is making treatment more difficult. Combined with evidence that two meningococcal vaccines are likely partially protective against gonorrhea, this has renewed interest in a gonococcal vaccine, and several candidates are in development. Key questions are how protective and long-lasting a vaccine needs to be, and how to target it. We assessed vaccination's potential impact and the feasibility of achieving the World Health Organization's (WHO) target of reducing gonorrhea incidence by 90% during 2018-2030, by comparing realistic vaccination strategies under a range of scenarios of vaccine efficacy and duration of protection, and emergence of extensively-resistant gonorrhea. METHODS: We developed a stochastic transmission-dynamic model, incorporating asymptomatic and symptomatic infection and heterogeneous sexual behavior in men who have sex with men (MSM). We used data from England, which has a comprehensive, consistent nationwide surveillance system. Using particle Markov chain Monte Carlo methods, we fitted to gonorrhea incidence in 2008-2017, then used Bayesian forecasting to examine an extensive range of scenarios. RESULTS: Even in the worst-case scenario of untreatable infection emerging, the WHO target is achievable if all MSM attending sexual health clinics receive a vaccine offering ≥ 52% protection for ≥ 6 years. A vaccine conferring 31% protection (as estimated for MeNZB) for 2-4 years could reduce incidence in 2030 by 45% in the worst-case scenario, and by 75% if > 70% of resistant gonorrhea remains treatable. CONCLUSIONS: Even a partially-protective vaccine, delivered through a realistic targeting strategy, could substantially reduce gonorrhea incidence, despite antibiotic resistance.
Subject(s)
Gonorrhea , Sexual and Gender Minorities , Bayes Theorem , Drug Resistance, Microbial , England , Gonorrhea/epidemiology , Gonorrhea/prevention & control , Homosexuality, Male , Humans , Male , Neisseria gonorrhoeae , VaccinationABSTRACT
Human networks of sexual contacts are dynamic by nature, with partnerships forming and breaking continuously over time. Sexual behaviours are also highly heterogeneous, so that the number of partners reported by individuals over a given period of time is typically distributed as a power-law. Both the dynamism and heterogeneity of sexual partnerships are likely to have an effect in the patterns of spread of sexually transmitted diseases. To represent these two fundamental properties of sexual networks, we developed a stochastic process of dynamic partnership formation and dissolution, which results in power-law numbers of partners over time. Model parameters can be set to produce realistic conditions in terms of the exponent of the power-law distribution, of the number of individuals without relationships and of the average duration of relationships. Using an outbreak of antibiotic resistant gonorrhoea amongst men have sex with men as a case study, we show that our realistic dynamic network exhibits different properties compared to the frequently used static networks or homogeneous mixing models. We also consider an approximation to our dynamic network model in terms of a much simpler branching process. We estimate the parameters of the generation time distribution and offspring distribution which can be used for example in the context of outbreak reconstruction based on genomic data. Finally, we investigate the impact of a range of interventions against gonorrhoea, including increased condom use, more frequent screening and immunisation, concluding that the latter shows great promise to reduce the burden of gonorrhoea, even if the vaccine was only partially effective or applied to only a random subset of the population.
Subject(s)
Disease Outbreaks , Gonorrhea/epidemiology , Models, Theoretical , Gonorrhea/transmission , Humans , Sexual BehaviorABSTRACT
The sequencing and comparative analysis of a collection of bacterial genomes from a single species or lineage of interest can lead to key insights into its evolution, ecology or epidemiology. The tool of choice for such a study is often to build a phylogenetic tree, and more specifically when possible a dated phylogeny, in which the dates of all common ancestors are estimated. Here, we propose a new Bayesian methodology to construct dated phylogenies which is specifically designed for bacterial genomics. Unlike previous Bayesian methods aimed at building dated phylogenies, we consider that the phylogenetic relationships between the genomes have been previously evaluated using a standard phylogenetic method, which makes our methodology much faster and scalable. This two-step approach also allows us to directly exploit existing phylogenetic methods that detect bacterial recombination, and therefore to account for the effect of recombination in the construction of a dated phylogeny. We analysed many simulated datasets in order to benchmark the performance of our approach in a wide range of situations. Furthermore, we present applications to three different real datasets from recent bacterial genomic studies. Our methodology is implemented in a R package called BactDating which is freely available for download at https://github.com/xavierdidelot/BactDating.
Subject(s)
Bayes Theorem , Evolution, Molecular , Genome, Bacterial , Models, Genetic , Phylogeny , Benchmarking , Computer Simulation , DNA, Bacterial/genetics , Datasets as Topic , Markov Chains , Monte Carlo Method , Mycobacterium leprae/genetics , Recombination, Genetic , Shigella sonnei/genetics , Software , Streptococcus pneumoniae/genetics , Time FactorsABSTRACT
BackgroundThe first cases of extensively drug resistant gonorrhoea were recorded in the United Kingdom in 2018. There is a public health need for strategies on how to deploy existing and novel antibiotics to minimise the risk of resistance development. As rapid point-of-care tests (POCTs) to predict susceptibility are coming to clinical use, coupling the introduction of an antibiotic with diagnostics that can slow resistance emergence may offer a novel paradigm for maximising antibiotic benefits. Gepotidacin is a novel antibiotic with known resistance and resistance-predisposing mutations. In particular, a mutation that confers resistance to ciprofloxacin acts as the 'stepping-stone' mutation to gepotidacin resistance.AimTo investigate how POCTs detecting Neisseria gonorrhoeae resistance mutations for ciprofloxacin and gepotidacin can be used to minimise the risk of resistance development to gepotidacin.MethodsWe use individual-based stochastic simulations to formally investigate the aim.ResultsThe level of testing needed to reduce the risk of resistance development depends on the mutation rate under treatment and the prevalence of stepping-stone mutations. A POCT is most effective if the mutation rate under antibiotic treatment is no more than two orders of magnitude above the mutation rate without treatment and the prevalence of stepping-stone mutations is 1-13%.ConclusionMutation frequencies and rates should be considered when estimating the POCT usage required to reduce the risk of resistance development in a given population. Molecular POCTs for resistance mutations and stepping-stone mutations to resistance are likely to become important tools in antibiotic stewardship.
Subject(s)
Anti-Bacterial Agents , Clinical Decision-Making , Drug Resistance, Bacterial , Gonorrhea , Point-of-Care Testing , Anti-Bacterial Agents/pharmacology , Anti-Bacterial Agents/therapeutic use , Clinical Decision-Making/methods , Drug Resistance, Bacterial/drug effects , Drug Resistance, Bacterial/genetics , Gonorrhea/drug therapy , Gonorrhea/microbiology , Humans , Neisseria gonorrhoeae/drug effects , Neisseria gonorrhoeae/genetics , United KingdomABSTRACT
Real-time PCR is a highly sensitive and powerful technology for the quantification of DNA and has become the method of choice in microbiology, bioengineering, and molecular biology. Currently, the analysis of real-time PCR data is hampered by only considering a single feature of the amplification profile to generate a standard curve. The current "gold standard" is the cycle-threshold ( Ct) method which is known to provide poor quantification under inconsistent reaction efficiencies. Multiple single-feature methods have been developed to overcome the limitations of the Ct method; however, there is an unexplored area of combining multiple features in order to benefit from their joint information. Here, we propose a novel framework that combines existing standard curve methods into a multidimensional standard curve. This is achieved by considering multiple features together such that each amplification curve is viewed as a point in a multidimensional space. Contrary to only considering a single-feature, in the multidimensional space, data points do not fall exactly on the standard curve, which enables a similarity measure between amplification curves based on distances between data points. We show that this framework expands the capabilities of standard curves in order to optimize quantification performance, provide a measure of how suitable an amplification curve is for a standard, and thus automatically detect outliers and increase the reliability of quantification. Our aim is to provide an affordable solution to enhance existing diagnostic settings through maximizing the amount of information extracted from conventional instruments.
Subject(s)
DNA/genetics , Real-Time Polymerase Chain Reaction/standardsABSTRACT
Nonparametric population genetic modeling provides a simple and flexible approach for studying demographic history and epidemic dynamics using pathogen sequence data. Existing Bayesian approaches are premised on stochastic processes with stationary increments which may provide an unrealistic prior for epidemic histories which feature extended period of exponential growth or decline. We show that nonparametric models defined in terms of the growth rate of the effective population size can provide a more realistic prior for epidemic history. We propose a nonparametric autoregressive model on the growth rate as a prior for effective population size, which corresponds to the dynamics expected under many epidemic situations. We demonstrate the use of this model within a Bayesian phylodynamic inference framework. Our method correctly reconstructs trends of epidemic growth and decline from pathogen genealogies even when genealogical data are sparse and conventional skyline estimators erroneously predict stable population size. We also propose a regression approach for relating growth rates of pathogen effective population size and time-varying variables that may impact the replicative fitness of a pathogen. The model is applied to real data from rabies virus and Staphylococcus aureus epidemics. We find a close correspondence between the estimated growth rates of a lineage of methicillin-resistant S. aureus and population-level prescription rates of $\beta$-lactam antibiotics. The new models are implemented in an open source R package called skygrowth which is available at https://github.com/mrc-ide/skygrowth.
Subject(s)
Models, Genetic , Rabies virus/physiology , Rabies/virology , Staphylococcal Infections/microbiology , Staphylococcus aureus/physiology , Anti-Bacterial Agents/administration & dosage , Anti-Bacterial Agents/pharmacology , Bayes Theorem , Methicillin-Resistant Staphylococcus aureus/drug effects , Methicillin-Resistant Staphylococcus aureus/physiology , Population Density , Population Growth , Statistics, Nonparametric , beta-Lactams/administration & dosage , beta-Lactams/pharmacologyABSTRACT
Genome-Wide Association Studies (GWAS) in microbial organisms have the potential to vastly improve the way we understand, manage, and treat infectious diseases. Yet, microbial GWAS methods established thus far remain insufficiently able to capitalise on the growing wealth of bacterial and viral genetic sequence data. Facing clonal population structure and homologous recombination, existing GWAS methods struggle to achieve both the precision necessary to reject spurious findings and the power required to detect associations in microbes. In this paper, we introduce a novel phylogenetic approach that has been tailor-made for microbial GWAS, which is applicable to organisms ranging from purely clonal to frequently recombining, and to both binary and continuous phenotypes. Our approach is robust to the confounding effects of both population structure and recombination, while maintaining high statistical power to detect associations. Thorough testing via application to simulated data provides strong support for the power and specificity of our approach and demonstrates the advantages offered over alternative cluster-based and dimension-reduction methods. Two applications to Neisseria meningitidis illustrate the versatility and potential of our method, confirming previously-identified penicillin resistance loci and resulting in the identification of both well-characterised and novel drivers of invasive disease. Our method is implemented as an open-source R package called treeWAS which is freely available at https://github.com/caitiecollins/treeWAS.
Subject(s)
Gene Expression Regulation, Bacterial , Genetic Association Studies , Phylogeny , Recombination, Genetic , Algorithms , Cluster Analysis , Computational Biology , Computer Simulation , Drug Resistance, Bacterial , Genome, Bacterial , Genomics , Humans , Models, Statistical , Neisseria meningitidis/genetics , Penicillins , Phenotype , Polymorphism, Single Nucleotide , Programming Languages , SoftwareABSTRACT
BACKGROUND: Reconstructing individual transmission events in an infectious disease outbreak can provide valuable information and help inform infection control policy. Recent years have seen considerable progress in the development of methodologies for reconstructing transmission chains using both epidemiological and genetic data. However, only a few of these methods have been implemented in software packages, and with little consideration for customisability and interoperability. Users are therefore limited to a small number of alternatives, incompatible tools with fixed functionality, or forced to develop their own algorithms at considerable personal effort. RESULTS: Here we present outbreaker2, a flexible framework for outbreak reconstruction. This R package re-implements and extends the original model introduced with outbreaker, but most importantly also provides a modular platform allowing users to specify custom models within an optimised inferential framework. As a proof of concept, we implement the within-host evolutionary model introduced with TransPhylo, which is very distinct from the original genetic model in outbreaker, and demonstrate how even complex model results can be successfully included with minimal effort. CONCLUSIONS: outbreaker2 provides a valuable starting point for future outbreak reconstruction tools, and represents a unifying platform that promotes customisability and interoperability. Implemented in the R software, outbreaker2 joins a growing body of tools for outbreak analysis.
Subject(s)
Disease Outbreaks , Software , Algorithms , Biological Evolution , Ebolavirus/physiology , Hemorrhagic Fever, Ebola/epidemiology , Hemorrhagic Fever, Ebola/virology , Humans , Markov Chains , Models, Theoretical , Monte Carlo MethodABSTRACT
Genomic data are increasingly being used to understand infectious disease epidemiology. Isolates from a given outbreak are sequenced, and the patterns of shared variation are used to infer which isolates within the outbreak are most closely related to each other. Unfortunately, the phylogenetic trees typically used to represent this variation are not directly informative about who infected whom-a phylogenetic tree is not a transmission tree. However, a transmission tree can be inferred from a phylogeny while accounting for within-host genetic diversity by coloring the branches of a phylogeny according to which host those branches were in. Here we extend this approach and show that it can be applied to partially sampled and ongoing outbreaks. This requires computing the correct probability of an observed transmission tree and we herein demonstrate how to do this for a large class of epidemiological models. We also demonstrate how the branch coloring approach can incorporate a variable number of unique colors to represent unsampled intermediates in transmission chains. The resulting algorithm is a reversible jump Monte-Carlo Markov Chain, which we apply to both simulated data and real data from an outbreak of tuberculosis. By accounting for unsampled cases and an outbreak which may not have reached its end, our method is uniquely suited to use in a public health environment during real-time outbreak investigations. We implemented this transmission tree inference methodology in an R package called TransPhylo, which is freely available from https://github.com/xavierdidelot/TransPhylo.
Subject(s)
Communicable Diseases/classification , Computational Biology/methods , Disease Transmission, Infectious/statistics & numerical data , Algorithms , Communicable Diseases/epidemiology , Communicable Diseases/genetics , Computer Simulation , Disease Outbreaks , Genomics/methods , Humans , Markov Chains , Models, Genetic , Monte Carlo Method , Phylogeny , Probability , SoftwareABSTRACT
Diversity of the polysaccharide capsule in Streptococcus pneumoniae-main surface antigen and the target of the currently used pneumococcal vaccines-constitutes a major obstacle in eliminating pneumococcal disease. Such diversity is genetically encoded by almost 100 variants of the capsule biosynthesis locus, cps. However, the evolutionary dynamics of the capsule remains not fully understood. Here, using genetic data from 4,519 bacterial isolates, we found cps to be an evolutionary hotspot with elevated substitution and recombination rates. These rates were a consequence of relaxed purifying selection and positive, diversifying selection acting at this locus, supporting the hypothesis that the capsule has an increased potential to generate novel diversity compared with the rest of the genome. Diversifying selection was particularly evident in the region of wzd/wze genes, which are known to regulate capsule expression and hence the bacterium's ability to cause disease. Using a novel, capsule-centered approach, we analyzed the evolutionary history of 12 major serogroups. Such analysis revealed their complex diversification scenarios, which were principally driven by recombination with other serogroups and other streptococci. Patterns of recombinational exchanges between serogroups could not be explained by serotype frequency alone, thus pointing to nonrandom associations between co-colonizing serotypes. Finally, we discovered a previously unobserved mosaic serotype 39X, which was confirmed to carry a viable and structurally novel capsule. Adding to previous discoveries of other mosaic capsules in densely sampled collections, these results emphasize the strong adaptive potential of the bacterium by its ability to generate novel antigenic diversity by recombination.