ABSTRACT
Inference and interpretation of evolutionary processes, in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models and tests. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased, often systematically, and lead to poor statistical performance. Previous work established that failing to accommodate multinucleotide (or multihit, MH) substitutions strongly biases dN/dS-based inference towards false-positive inferences of diversifying episodic selection, as does failing to model variation in the rate of synonymous substitution (SRV) among sites. Here, we develop an integrated analytical framework and software tools to simultaneously incorporate these sources of evolutionary complexity into selection analyses. We found that both MH and SRV are ubiquitous in empirical alignments, and incorporating them has a strong effect on whether or not positive selection is detected (1.4-fold reduction) and on the distributions of inferred evolutionary rates. With simulation studies, we show that this effect is not attributable to reduced statistical power caused by using a more complex model. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we show that MH substitutions occurring along shorter branches in the tree explain a significant fraction of discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions (including MH) and finds them to be problematic for comparative genomic data analysis. Because multinucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that selection analyses of this type consider their inclusion as a matter of routine. To facilitate this procedure, we developed, implemented, and benchmarked a simple and well-performing model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: site-to-site synonymous rate variation, and multinucleotide instantaneous substitutions.
Subject(s)
Evolution, Molecular , Models, Genetic , Genomics , Biological Evolution , Selection, Genetic , Bias , Humans , Animals , Heuristics , Computer Simulation , Polymorphism, Single Nucleotide , Amino Acid Substitution , Polymorphism, Genetic , Viruses/geneticsABSTRACT
Infections caused by antimicrobial-resistant Escherichia coli are the leading cause of death attributed to antimicrobial resistance (AMR) worldwide, and the known AMR mechanisms involve a range of functional proteins. Here, we employed a pan-genome wide association study (GWAS) approach on over 1,000 E. coli isolates from sick dogs collected across the US and Canada and identified a strong statistical association (empirical P < 0.01) of AMR, involving a range of antibiotics to a group 1 capsular (CPS) gene cluster. This cluster included genes under relaxed selection pressure, had several loci missing, and had pseudogenes for other key loci. Furthermore, this cluster is widespread in E. coli and Klebsiella clinical isolates across multiple host species. Earlier studies demonstrated that the octameric CPS polysaccharide export protein Wza can transmit macrolide antibiotics into the E. coli periplasm. We suggest that the CPS in question, and its highly divergent Wza, functions as an antibiotic trap, preventing antimicrobial penetration. We also highlight the high diversity of lineages circulating in dogs across all regions studied, the overlap with human lineages, and regional prevalence of resistance to multiple antimicrobial classes. IMPORTANCE: Much of the human genomic epidemiology data available for E. coli mechanism discovery studies has been heavily biased toward shiga-toxin producing strains from humans and livestock. E. coli occupies many niches and produces a wide variety of other significant pathotypes, including some implicated in chronic disease. We hypothesized that since dogs tend to share similar strains with their owners and are treated with similar antibiotics, their pathogenic isolates will harbor unexplored AMR mechanisms of importance to humans as well as animals. By comparing over 1,000 genomes with in vitro antimicrobial susceptibility data from sick dogs across the US and Canada, we identified a strong multidrug resistance association with an operon that appears to have once conferred a type 1 capsule production system.
Subject(s)
Anti-Bacterial Agents , Dog Diseases , Drug Resistance, Multiple, Bacterial , Escherichia coli Infections , Escherichia coli , Dogs , Animals , Escherichia coli/genetics , Escherichia coli/drug effects , Dog Diseases/microbiology , Escherichia coli Infections/veterinary , Escherichia coli Infections/microbiology , Anti-Bacterial Agents/pharmacology , Drug Resistance, Multiple, Bacterial/genetics , Canada , Genome-Wide Association Study , Genome, Bacterial , United States , Bacterial Capsules/genetics , Multigene Family , Evolution, Molecular , Genomics , Escherichia coli Proteins/geneticsABSTRACT
BACKGROUND: Protein-protein interactions play a crucial role in almost all cellular processes. Identifying interacting proteins reveals insight into living organisms and yields novel drug targets for disease treatment. Here, we present a publicly available, automated pipeline to predict genome-wide protein-protein interactions and produce high-quality multimeric structural models. RESULTS: Application of our method to the Human and Yeast genomes yield protein-protein interaction networks similar in quality to common experimental methods. We identified and modeled Human proteins likely to interact with the papain-like protease of SARS-CoV2's non-structural protein 3. We also produced models of SARS-CoV2's spike protein (S) interacting with myelin-oligodendrocyte glycoprotein receptor and dipeptidyl peptidase-4. CONCLUSIONS: The presented method is capable of confidently identifying interactions while providing high-quality multimeric structural models for experimental validation. The interactome modeling pipeline is available at usegalaxy.org and usegalaxy.eu.
Subject(s)
COVID-19 , Protein Interaction Mapping , Humans , RNA, Viral/metabolism , SARS-CoV-2 , Saccharomyces cerevisiae/metabolismABSTRACT
There are several examples of coronaviruses in the Betacoronavirus subgenus Embecovirus that have jumped from an animal to the human host. Studying how evolutionary factors shape coronaviruses in non-human hosts may provide insight into the coronavirus host-switching potential. Equids, such as horses and donkeys, are susceptible to equine coronaviruses (ECoVs). With increased testing prevalence, several ECoV genome sequences have become available for molecular evolutionary analyses, especially those from the United States of America (USA). To date, no analyses have been performed to characterize evolution within coding regions of the ECoV genome. Here, we obtain and describe four new ECoV genome sequences from infected equines from across the USA presenting clinical symptoms of ECoV, and infer ECoV-specific and Embecovirus-wide patterns of molecular evolution. Within two of the four data sets analyzed, we find evidence of intra-host evolution within the nucleocapsid (N) gene, suggestive of quasispecies development. We also identify 12 putative genetic recombination events within the ECoV genome, 11 of which fall in ORF1ab. Finally, we infer and compare sites subject to positive selection on the ancestral branch of each major Embecovirus member clade. Specifically, for the two currently identified human coronavirus (HCoV) embecoviruses that have spilled from animals to humans (HCoV-OC43 and HCoV-HKU1), we find that there are 42 and 2 such sites, respectively, perhaps reflective of the more complex ancestral evolutionary history of HCoV-OC43, which involves several different animal hosts.IMPORTANCEThe Betacoronavirus subgenus Embecovirus contains coronaviruses that not only pose a health threat to animals and humans, but also have jumped from animal to human host. Equids, such as horses and donkeys are susceptible to equine coronavirus (ECoV) infections. No studies have systematically examined evolutionary patterns within ECoV genomes. Our study addresses this gap and provides insight into intra-host ECoV evolution from infected horses. Further, we identify and report natural selection pattern differences between two embecoviruses that have jumped from animals to humans [human coronavirus OC43 and HKU1 (HCoV-OC43 and HCoV-HKU1, respectively)], and hypothesize that the differences observed may be due to the different animal host(s) that each virus circulated in prior to its jump into humans. Finally, we contribute four novel, high-quality ECoV genomes to the scientific community.
ABSTRACT
Despite increasing threats of extinction to Elasmobranchii (sharks and rays), whole genome-based conservation insights are lacking. Here, we present chromosome-level genome assemblies for the Critically Endangered great hammerhead (Sphyrna mokarran) and the Endangered shortfin mako (Isurus oxyrinchus) sharks, with genetic diversity and historical demographic comparisons to other shark species. The great hammerhead exhibited low genetic variation, with 8.7% of the 2.77 Gbp genome in runs of homozygosity (ROH) > 1 Mbp and 74.4% in ROH >100 kbp. The 4.98 Gbp shortfin mako genome had considerably greater diversity and <1% in ROH > 1 Mbp. Both these sharks experienced precipitous declines in effective population size (Ne) over the last 250 thousand years. While shortfin mako exhibited a large historical Ne that may have enabled the retention of higher genetic variation, the genomic data suggest a possibly more concerning picture for the great hammerhead, and a need for evaluation with additional individuals.
ABSTRACT
Feline Coronaviruses (FCoVs) commonly cause mild enteric infections in felines worldwide (termed Feline Enteric Coronavirus [FECV]), with around 12% developing into deadly Feline Infectious Peritonitis (FIP; Feline Infectious Peritonitis Virus [FIPV]). Genomic differences between FECV and FIPV have been reported, yet the putative genotypic basis of the highly pathogenic phenotype remains unclear. Here, we used state-of-the-art molecular evolutionary genetic statistical techniques to identify and compare differences in natural selection pressure between FECV and FIPV sequences, as well as to identify FIPV and FECV specific signals of positive selection. We analyzed full length FCoV protein coding genes thought to contain mutations associated with FIPV (Spike, ORF3abc, and ORF7ab). We identified two sites exhibiting differences in natural selection pressure between FECV and FIPV: one within the S1/S2 furin cleavage site, and the other within the fusion domain of Spike. We also found 15 sites subject to positive selection associated with FIPV within Spike, 11 of which have not previously been suggested as possibly relevant to FIP development. These sites fall within Spike protein subdomains that participate in host cell receptor interaction, immune evasion, tropism shifts, host cellular entry, and viral escape. There were 14 sites (12 novel) within Spike under positive selection associated with the FECV phenotype, almost exclusively within the S1/S2 furin cleavage site and adjacent C domain, along with a signal of relaxed selection in FIPV relative to FECV, suggesting that furin cleavage functionality may not be needed for FIPV. Positive selection inferred in ORF7b was associated with the FECV phenotype, and included 24 positively selected sites, while ORF7b had signals of relaxed selection in FIPV. We found evidence of positive selection in ORF3c in FCoV wide analyses, but no specific association with the FIPV or FECV phenotype. We hypothesize that some combination of mutations in FECV may contribute to FIP development, and that is unlikely to be one singular "switch" mutational event. This work expands our understanding of the complexities of FIP development and provides insights into how evolutionary forces may alter pathogenesis in coronavirus genomes.
ABSTRACT
Feline coronaviruses (FCoVs) commonly cause mild enteric infections in felines worldwide (termed feline enteric coronavirus [FECV]), with around 12 per cent developing into deadly feline infectious peritonitis (FIP; feline infectious peritonitis virus [FIPV]). Genomic differences between FECV and FIPV have been reported, yet the putative genotypic basis of the highly pathogenic phenotype remains unclear. Here, we used state-of-the-art molecular evolutionary genetic statistical techniques to identify and compare differences in natural selection pressure between FECV and FIPV sequences, as well as to identify FIPV- and FECV-specific signals of positive selection. We analyzed full-length FCoV protein coding genes thought to contain mutations associated with FIPV (Spike, ORF3abc, and ORF7ab). We identified two sites exhibiting differences in natural selection pressure between FECV and FIPV: one within the S1/S2 furin cleavage site (FCS) and the other within the fusion domain of Spike. We also found fifteen sites subject to positive selection associated with FIPV within Spike, eleven of which have not previously been suggested as possibly relevant to FIP development. These sites fall within Spike protein subdomains that participate in host cell receptor interaction, immune evasion, tropism shifts, host cellular entry, and viral escape. There were fourteen sites (twelve novel sites) within Spike under positive selection associated with the FECV phenotype, almost exclusively within the S1/S2 FCS and adjacent to C domain, along with a signal of relaxed selection in FIPV relative to FECV, suggesting that furin cleavage functionality may not be needed for FIPV. Positive selection inferred in ORF7b was associated with the FECV phenotype and included twenty-four positively selected sites, while ORF7b had signals of relaxed selection in FIPV. We found evidence of positive selection in ORF3c in FCoV-wide analyses, but no specific association with the FIPV or FECV phenotype. We hypothesize that some combination of mutations in FECV may contribute to FIP development, and that it is unlikely to be one singular 'switch' mutational event. This work expands our understanding of the complexities of FIP development and provides insights into how evolutionary forces may alter pathogenesis in coronavirus genomes.
ABSTRACT
An important component of efforts to manage the ongoing COVID19 pandemic is the R apid A ssessment of how natural selection contributes to the emergence and proliferation of potentially dangerous S ARS-CoV-2 lineages and CL ades (RASCL). The RASCL pipeline enables continuous comparative phylogenetics-based selection analyses of rapidly growing clade-focused genome surveillance datasets, such as those produced following the initial detection of potentially dangerous variants. From such datasets RASCL automatically generates down-sampled codon alignments of individual genes/ORFs containing contextualizing background reference sequences, analyzes these with a battery of selection tests, and outputs results as both machine readable JSON files, and interactive notebook-based visualizations. AVAILABILITY: RASCL is available from a dedicated repository at https://github.com/veg/RASCL and as a Galaxy workflow https://usegalaxy.eu/u/hyphy/w/rascl . Existing clade/variant analysis results are available here: https://observablehq.com/@aglucaci/rascl . CONTACT: Dr. Sergei L Kosakovsky Pond ( spond@temple.edu ). SUPPLEMENTARY INFORMATION: N/A.
ABSTRACT
Small heat shock proteins (sHSPs) emerged early in evolution and occur in all domains of life and nearly in all species, including humans. Mutations in four sHSPs (HspB1, HspB3, HspB5, HspB8) are associated with neuromuscular disorders. The aim of this study is to investigate the evolutionary forces shaping these sHSPs during vertebrate evolution. We performed comparative evolutionary analyses on a set of orthologous sHSP sequences, based on the ratio of non-synonymous: synonymous substitution rates for each codon. We found that these sHSPs had been historically exposed to different degrees of purifying selection, decreasing in this order: HspB8 > HspB1, HspB5 > HspB3. Within each sHSP, regions with different degrees of purifying selection can be discerned, resulting in characteristic selective pressure profiles. The conserved α-crystallin domains were exposed to the most stringent purifying selection compared to the flanking regions, supporting a 'dimorphic pattern' of evolution. Thus, during vertebrate evolution the different sequence partitions were exposed to different and measurable degrees of selective pressures. Among the disease-associated mutations, most are missense mutations primarily in HspB1 and to a lesser extent in the other sHSPs. Our data provide an explanation for this disparate incidence. Contrary to the expectation, most missense mutations cause dominant disease phenotypes. Theoretical considerations support a connection between the historic exposure of these sHSP genes to a high degree of purifying selection and the unusual prevalence of genetic dominance of the associated disease phenotypes. Our study puts the genetics of inheritable sHSP-borne diseases into the context of vertebrate evolution.
Subject(s)
Heat-Shock Proteins , Molecular Chaperones , alpha-Crystallins , Animals , Heat-Shock Proteins/genetics , Heat-Shock Proteins, Small/genetics , Humans , Molecular Chaperones/genetics , Mutation , Vertebrates/genetics , alpha-Crystallin B Chain , alpha-Crystallins/geneticsABSTRACT
An important unmet need revealed by the COVID-19 pandemic is the near-real-time identification of potentially fitness-altering mutations within rapidly growing SARS-CoV-2 lineages. Although powerful molecular sequence analysis methods are available to detect and characterize patterns of natural selection within modestly sized gene-sequence datasets, the computational complexity of these methods and their sensitivity to sequencing errors render them effectively inapplicable in large-scale genomic surveillance contexts. Motivated by the need to analyze new lineage evolution in near-real time using large numbers of genomes, we developed the Rapid Assessment of Selection within CLades (RASCL) pipeline. RASCL applies state of the art phylogenetic comparative methods to evaluate selective processes acting at individual codon sites and across whole genes. RASCL is scalable and produces automatically updated regular lineage-specific selection analysis reports: even for lineages that include tens or hundreds of thousands of sampled genome sequences. Key to this performance is (i) generation of automatically subsampled high quality datasets of gene/ORF sequences drawn from a selected "query" viral lineage; (ii) contextualization of these query sequences in codon alignments that include high-quality "background" sequences representative of global SARS-CoV-2 diversity; and (iii) the extensive parallelization of a suite of computationally intensive selection analysis tests. Within hours of being deployed to analyze a novel rapidly growing lineage of interest, RASCL will begin yielding JavaScript Object Notation (JSON)-formatted reports that can be either imported into third-party analysis software or explored in standard web-browsers using the premade RASCL interactive data visualization dashboard. By enabling the rapid detection of genome sites evolving under different selective regimes, RASCL is well-suited for near-real-time monitoring of the population-level selective processes that will likely underlie the emergence of future variants of concern in measurably evolving pathogens with extensive genomic surveillance.
Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , Pandemics , COVID-19/epidemiology , COVID-19/genetics , Phylogeny , Codon/genetics , Sequence Analysis , Genome, ViralABSTRACT
A canine coronavirus (CCoV) has now been reported from two independent human samples from Malaysia (respiratory, collected in 2017-2018; CCoV-HuPn-2018) and Haiti (urine, collected in 2017); these two viruses were nearly genetically identical. In an effort to identify any novel adaptations associated with this apparent shift in tropism we carried out detailed evolutionary analyses of the spike gene of this virus in the context of related Alphacoronavirus 1 species. The spike 0-domain retains homology to CCoV2b (enteric infections) and Transmissible Gastroenteritis Virus (TGEV; enteric and respiratory). This domain is subject to relaxed selection pressure and an increased rate of molecular evolution. It contains unique amino acid substitutions, including within a region important for sialic acid binding and pathogenesis in TGEV. Overall, the spike gene is extensively recombinant, with a feline coronavirus type II strain serving a prominent role in the recombinant history of the virus. Molecular divergence time for a segment of the gene where temporal signal could be determined, was estimated at around 60 years ago. We hypothesize that the virus had an enteric origin, but that it may be losing that particular tropism, possibly because of mutations in the sialic acid binding region of the spike 0-domain.
Subject(s)
Coronavirus, Canine , Animals , Cats , Dogs , N-Acetylneuraminic Acid , Spike Glycoprotein, Coronavirus/genetics , Tropism , ZoonosesABSTRACT
Recombination contributes to the genetic diversity found in coronaviruses and is known to be a prominent mechanism whereby they evolve. It is apparent, both from controlled experiments and in genome sequences sampled from nature, that patterns of recombination in coronaviruses are non-random and that this is likely attributable to a combination of sequence features that favour the occurrence of recombination break points at specific genomic sites, and selection disfavouring the survival of recombinants within which favourable intra-genome interactions have been disrupted. Here we leverage available whole-genome sequence data for six coronavirus subgenera to identify specific patterns of recombination that are conserved between multiple subgenera and then identify the likely factors that underlie these conserved patterns. Specifically, we confirm the non-randomness of recombination break points across all six tested coronavirus subgenera, locate conserved recombination hot- and cold-spots, and determine that the locations of transcriptional regulatory sequences are likely major determinants of conserved recombination break-point hotspot locations. We find that while the locations of recombination break points are not uniformly associated with degrees of nucleotide sequence conservation, they display significant tendencies in multiple coronavirus subgenera to occur in low guanine-cytosine content genome regions, in non-coding regions, at the edges of genes, and at sites within the Spike gene that are predicted to be minimally disruptive of Spike protein folding. While it is apparent that sequence features such as transcriptional regulatory sequences are likely major determinants of where the template-switching events that yield recombination break points most commonly occur, it is evident that selection against misfolded recombinant proteins also strongly impacts observable recombination break-point distributions in coronavirus genomes sampled from nature.