Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 29
Filter
1.
Cell ; 184(20): 5189-5200.e7, 2021 09 30.
Article in English | MEDLINE | ID: mdl-34537136

ABSTRACT

The independent emergence late in 2020 of the B.1.1.7, B.1.351, and P.1 lineages of SARS-CoV-2 prompted renewed concerns about the evolutionary capacity of this virus to overcome public health interventions and rising population immunity. Here, by examining patterns of synonymous and non-synonymous mutations that have accumulated in SARS-CoV-2 genomes since the pandemic began, we find that the emergence of these three "501Y lineages" coincided with a major global shift in the selective forces acting on various SARS-CoV-2 genes. Following their emergence, the adaptive evolution of 501Y lineage viruses has involved repeated selectively favored convergent mutations at 35 genome sites, mutations we refer to as the 501Y meta-signature. The ongoing convergence of viruses in many other lineages on this meta-signature suggests that it includes multiple mutation combinations capable of promoting the persistence of diverse SARS-CoV-2 lineages in the face of mounting host immune recognition.


Subject(s)
COVID-19/epidemiology , Evolution, Molecular , Mutation , Pandemics , SARS-CoV-2/genetics , Amino Acid Sequence/genetics , COVID-19/immunology , COVID-19/transmission , COVID-19/virology , Codon/genetics , Genes, Viral , Genetic Drift , Host Adaptation/genetics , Humans , Immune Evasion , Phylogeny , Public Health
2.
Nature ; 592(7854): 438-443, 2021 04.
Article in English | MEDLINE | ID: mdl-33690265

ABSTRACT

Continued uncontrolled transmission of SARS-CoV-2 in many parts of the world is creating conditions for substantial evolutionary changes to the virus1,2. Here we describe a newly arisen lineage of SARS-CoV-2 (designated 501Y.V2; also known as B.1.351 or 20H) that is defined by eight mutations in the spike protein, including three substitutions (K417N, E484K and N501Y) at residues in its receptor-binding domain that may have functional importance3-5. This lineage was identified in South Africa after the first wave of the epidemic in a severely affected metropolitan area (Nelson Mandela Bay) that is located on the coast of the Eastern Cape province. This lineage spread rapidly, and became dominant in Eastern Cape, Western Cape and KwaZulu-Natal provinces within weeks. Although the full import of the mutations is yet to be determined, the genomic data-which show rapid expansion and displacement of other lineages in several regions-suggest that this lineage is associated with a selection advantage that most plausibly results from increased transmissibility or immune escape6-8.


Subject(s)
COVID-19/virology , Mutation , Phylogeny , Phylogeography , SARS-CoV-2/genetics , SARS-CoV-2/isolation & purification , COVID-19/epidemiology , COVID-19/immunology , COVID-19/transmission , DNA Mutational Analysis , Evolution, Molecular , Genetic Fitness , Humans , Immune Evasion , Models, Molecular , SARS-CoV-2/immunology , SARS-CoV-2/pathogenicity , Selection, Genetic , South Africa/epidemiology , Spike Glycoprotein, Coronavirus/chemistry , Spike Glycoprotein, Coronavirus/genetics , Spike Glycoprotein, Coronavirus/metabolism , Time Factors
3.
PLoS Biol ; 19(3): e3001115, 2021 03.
Article in English | MEDLINE | ID: mdl-33711012

ABSTRACT

Virus host shifts are generally associated with novel adaptations to exploit the cells of the new host species optimally. Surprisingly, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has apparently required little to no significant adaptation to humans since the start of the Coronavirus Disease 2019 (COVID-19) pandemic and to October 2020. Here we assess the types of natural selection taking place in Sarbecoviruses in horseshoe bats versus the early SARS-CoV-2 evolution in humans. While there is moderate evidence of diversifying positive selection in SARS-CoV-2 in humans, it is limited to the early phase of the pandemic, and purifying selection is much weaker in SARS-CoV-2 than in related bat Sarbecoviruses. In contrast, our analysis detects evidence for significant positive episodic diversifying selection acting at the base of the bat virus lineage SARS-CoV-2 emerged from, accompanied by an adaptive depletion in CpG composition presumed to be linked to the action of antiviral mechanisms in these ancestral bat hosts. The closest bat virus to SARS-CoV-2, RmYN02 (sharing an ancestor about 1976), is a recombinant with a structure that includes differential CpG content in Spike; clear evidence of coinfection and evolution in bats without involvement of other species. While an undiscovered "facilitating" intermediate species cannot be discounted, collectively, our results support the progenitor of SARS-CoV-2 being capable of efficient human-human transmission as a consequence of its adaptive evolutionary history in bats, not humans, which created a relatively generalist virus.


Subject(s)
COVID-19/virology , Chiroptera/virology , SARS-CoV-2/genetics , Viral Zoonoses/virology , Animals , COVID-19/epidemiology , COVID-19/transmission , Evolution, Molecular , Genome, Viral , Host Specificity , Humans , Pandemics , Phylogeny , Receptors, Virus/genetics , SARS-CoV-2/pathogenicity , Selection, Genetic , Viral Zoonoses/genetics , Viral Zoonoses/transmission
4.
Mol Biol Evol ; 39(4)2022 04 11.
Article in English | MEDLINE | ID: mdl-35325204

ABSTRACT

Among the 30 nonsynonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (1) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (2) interactions of Spike with ACE2 receptors, and (3) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any virus within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and, in combination with other mutations, adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron overall previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.


Subject(s)
COVID-19 , Spike Glycoprotein, Coronavirus , COVID-19/genetics , Humans , Mutation , SARS-CoV-2/genetics , Spike Glycoprotein, Coronavirus/genetics
5.
Bioinformatics ; 38(10): 2719-2726, 2022 05 13.
Article in English | MEDLINE | ID: mdl-35561179

ABSTRACT

MOTIVATION: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features. RESULTS: We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern. AVAILABILITY AND IMPLEMENTATION: TopHap is available at https://github.com/SayakaMiura/TopHap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
COVID-19 , SARS-CoV-2 , Genome, Viral , Haplotypes , Humans , Mutation , Phylogeny , SARS-CoV-2/genetics
6.
Mol Biol Evol ; 38(3): 1184-1198, 2021 03 09.
Article in English | MEDLINE | ID: mdl-33064823

ABSTRACT

A number of evolutionary hypotheses can be tested by comparing selective pressures among sets of branches in a phylogenetic tree. When the question of interest is to identify specific sites within genes that may be evolving differently, a common approach is to perform separate analyses on subsets of sequences and compare parameter estimates in a post hoc fashion. This approach is statistically suboptimal and not always applicable. Here, we develop a simple extension of a popular fixed effects likelihood method in the context of codon-based evolutionary phylogenetic maximum likelihood testing, Contrast-FEL. It is suitable for identifying individual alignment sites where any among the K≥2 sets of branches in a phylogenetic tree have detectably different ω ratios, indicative of different selective regimes. Using extensive simulations, we show that Contrast-FEL delivers good power, exceeding 90% for sufficiently large differences, while maintaining tight control over false positive rates, when the model is correctly specified. We conclude by applying Contrast-FEL to data from five previously published studies spanning a diverse range of organisms and focusing on different evolutionary questions.


Subject(s)
Genetic Techniques , Phylogeny , Selection, Genetic , Brassicaceae/genetics , Cytochromes b/genetics , HIV Reverse Transcriptase/genetics , Haemosporida/genetics , Rhodopsin/genetics , Ribulose-Bisphosphate Carboxylase/genetics , Trichomes/genetics
7.
Mol Biol Evol ; 38(8): 3046-3059, 2021 07 29.
Article in English | MEDLINE | ID: mdl-33942847

ABSTRACT

Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (http://sars2evo.datamonkey.org/).


Subject(s)
COVID-19/genetics , SARS-CoV-2/genetics , Biological Evolution , COVID-19/metabolism , Computational Biology/methods , Contact Tracing/methods , Evolution, Molecular , Genome, Viral , Humans , Mutation , Pandemics , Phylogeny , SARS-CoV-2/metabolism , SARS-CoV-2/pathogenicity , Sequence Analysis, DNA/methods
8.
PLoS Pathog ; 16(8): e1008643, 2020 08.
Article in English | MEDLINE | ID: mdl-32790776

ABSTRACT

The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.


Subject(s)
Betacoronavirus/pathogenicity , Coronavirus Infections/virology , Pneumonia, Viral/virology , Public Health , Severe Acute Respiratory Syndrome/virology , COVID-19 , Data Analysis , Humans , Pandemics , SARS-CoV-2
9.
Mol Biol Evol ; 37(1): 295-299, 2020 Jan 01.
Article in English | MEDLINE | ID: mdl-31504749

ABSTRACT

HYpothesis testing using PHYlogenies (HyPhy) is a scriptable, open-source package for fitting a broad range of evolutionary models to multiple sequence alignments, and for conducting subsequent parameter estimation and hypothesis testing, primarily in the maximum likelihood statistical framework. It has become a popular choice for characterizing various aspects of the evolutionary process: natural selection, evolutionary rates, recombination, and coevolution. The 2.5 release (available from www.hyphy.org) includes a completely re-engineered computational core and analysis library that introduces new classes of evolutionary models and statistical tests, delivers substantial performance and stability enhancements, improves usability, streamlines end-to-end analysis workflows, makes it easier to develop custom analyses, and is mostly backward compatible with previous HyPhy releases.


Subject(s)
Genetic Techniques , Phylogeny , Software
10.
Mol Biol Evol ; 35(7): 1812-1819, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29401317

ABSTRACT

In modern applications of molecular epidemiology, genetic sequence data are routinely used to identify clusters of transmission in rapidly evolving pathogens, most notably HIV-1. Traditional 'shoe-leather' epidemiology infers transmission clusters by tracing chains of partners sharing epidemiological connections (e.g., sexual contact). Here, we present a computational tool for identifying a molecular transmission analog of such clusters: HIV-TRACE (TRAnsmission Cluster Engine). HIV-TRACE implements an approach inspired by traditional epidemiology, by identifying chains of partners whose viral genetic relatedness imply direct or indirect epidemiological connections. Molecular transmission clusters are constructed using codon-aware pairwise alignment to a reference sequence followed by pairwise genetic distance estimation among all sequences. This approach is computationally tractable and is capable of identifying HIV-1 transmission clusters in large surveillance databases comprising tens or hundreds of thousands of sequences in near real time, that is, on the order of minutes to hours. HIV-TRACE is available at www.hivtrace.org and from www.github.com/veg/hivtrace, along with the accompanying result visualization module from www.github.com/veg/hivtrace-viz. Importantly, the approach underlying HIV-TRACE is not limited to the study of HIV-1 and can be applied to study outbreaks and epidemics of other rapidly evolving pathogens.


Subject(s)
HIV Infections/transmission , HIV-1/genetics , Molecular Epidemiology/methods , Computational Biology , HIV Infections/epidemiology , Humans , Software
11.
Mol Biol Evol ; 35(3): 773-777, 2018 Mar 01.
Article in English | MEDLINE | ID: mdl-29301006

ABSTRACT

Inference of how evolutionary forces have shaped extant genetic diversity is a cornerstone of modern comparative sequence analysis. Advances in sequence generation and increased statistical sophistication of relevant methods now allow researchers to extract ever more evolutionary signal from the data, albeit at an increased computational cost. Here, we announce the release of Datamonkey 2.0, a completely re-engineered version of the Datamonkey web-server for analyzing evolutionary signatures in sequence data. For this endeavor, we leveraged recent developments in open-source libraries that facilitate interactive, robust, and scalable web application development. Datamonkey 2.0 provides a carefully curated collection of methods for interrogating coding-sequence alignments for imprints of natural selection, packaged as a responsive (i.e. can be viewed on tablet and mobile devices), fully interactive, and API-enabled web application. To complement Datamonkey 2.0, we additionally release HyPhy Vision, an accompanying JavaScript application for visualizing analysis results. HyPhy Vision can also be used separately from Datamonkey 2.0 to visualize locally executed HyPhy analyses. Together, Datamonkey 2.0 and HyPhy Vision showcase how scientific software development can benefit from general-purpose open-source frameworks. Datamonkey 2.0 is freely and publicly available at http://www.datamonkey.org, and the underlying codebase is available from https://github.com/veg/datamonkey-js.

12.
PLoS Comput Biol ; 14(12): e1006498, 2018 12.
Article in English | MEDLINE | ID: mdl-30543621

ABSTRACT

Next generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data. FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN/dS) across time and across protein structure, and a phylogenetic tree browser. We demonstrate how FLEA may be used to process Pacific Biosciences HIV env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV env populations. A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.


Subject(s)
Sequence Alignment/methods , Sequence Analysis, DNA/methods , Viruses/genetics , High-Throughput Nucleotide Sequencing/methods , Phylogeny , Software
13.
BMC Bioinformatics ; 19(1): 276, 2018 07 25.
Article in English | MEDLINE | ID: mdl-30045713

ABSTRACT

BACKGROUND: While several JavaScript packages for visualizing phylogenetic trees exist, most are best characterized as frameworks that are designed with a specific set of tasks in mind. Extending such packages to use cases that are not available as features often ends up being difficult. Moreover, existing packages tend to produce standalone widgets that are not designed to serve as middleware, as opposed to flexible tools that can integrate with other components of an application. RESULTS: phylotree.js is a library that extends the popular data visualization framework d3.js, and is suitable for building JavaScript applications where users can view and interact with phylogenetic trees. The effects of such interactions can be captured and communicated to other package components, making it possible to engineer complex and responsive applications that include phylogenetic trees. phylotree.js implements several abstractions in addition to features, and comes with a documented application programming interface, thus promoting interoperability and extensibility. Example applications include a tool to visualize and annotate phylogenetic trees, a web application for comparative sequence analysis, a structural viewer that interacts with a large phylogenetic tree, and an interactive tanglegram. CONCLUSIONS: phylotree.js is a useful tool and application module for a variety of computational biology software applications. The code is available on Github and is released under the MIT license.


Subject(s)
Computational Biology/methods , Phylogeny , Software , Sequence Analysis, DNA , User-Computer Interface
14.
Mol Biol Evol ; 32(5): 1342-53, 2015 May.
Article in English | MEDLINE | ID: mdl-25697341

ABSTRACT

Over the past two decades, comparative sequence analysis using codon-substitution models has been honed into a powerful and popular approach for detecting signatures of natural selection from molecular data. A substantial body of work has focused on developing a class of "branch-site" models which permit selective pressures on sequences, quantified by the ω ratio, to vary among both codon sites and individual branches in the phylogeny. We develop and present a method in this class, adaptive branch-site random effects likelihood (aBSREL), whose key innovation is variable parametric complexity chosen with an information theoretic criterion. By applying models of different complexity to different branches in the phylogeny, aBSREL delivers statistical performance matching or exceeding best-in-class existing approaches, while running an order of magnitude faster. Based on simulated data analysis, we offer guidelines for what extent and strength of diversifying positive selection can be detected reliably and suggest that there is a natural limit on the optimal parametric complexity for "branch-site" models. An aBSREL analysis of 8,893 Euteleostomes gene alignments demonstrates that over 80% of branches in typical gene phylogenies can be adequately modeled with a single ω ratio model, that is, current models are unnecessarily complicated. However, there are a relatively small number of key branches, whose identities are derived from the data using a model selection procedure, for which it is essential to accurately model evolutionary complexity.


Subject(s)
Codon/genetics , Evolution, Molecular , Selection, Genetic/genetics , Computer Simulation , Genetic Variation , Phylogeny
15.
Mol Biol Evol ; 32(5): 1365-71, 2015 May.
Article in English | MEDLINE | ID: mdl-25701167

ABSTRACT

We present BUSTED, a new approach to identifying gene-wide evidence of episodic positive selection, where the non-synonymous substitution rate is transiently greater than the synonymous rate. BUSTED can be used either on an entire phylogeny (without requiring an a priori hypothesis regarding which branches are under positive selection) or on a pre-specified subset of foreground lineages (if a suitable a priori hypothesis is available). Selection is modeled as varying stochastically over branches and sites, and we propose a computationally inexpensive evidence metric for identifying sites subject to episodic positive selection on any foreground branches. We compare BUSTED with existing models on simulated and empirical data. An implementation is available on www.datamonkey.org/busted, with a widget allowing the interactive specification of foreground branches.


Subject(s)
Computer Simulation , Evolution, Molecular , Selection, Genetic/genetics , Models, Genetic , Phylogeny
16.
PLoS Comput Biol ; 10(9): e1003842, 2014 Sep.
Article in English | MEDLINE | ID: mdl-25254639

ABSTRACT

Since its identification in 1983, HIV-1 has been the focus of a research effort unprecedented in scope and difficulty, whose ultimate goals--a cure and a vaccine--remain elusive. One of the fundamental challenges in accomplishing these goals is the tremendous genetic variability of the virus, with some genes differing at as many as 40% of nucleotide positions among circulating strains. Because of this, the genetic bases of many viral phenotypes, most notably the susceptibility to neutralization by a particular antibody, are difficult to identify computationally. Drawing upon open-source general-purpose machine learning algorithms and libraries, we have developed a software package IDEPI (IDentify EPItopes) for learning genotype-to-phenotype predictive models from sequences with known phenotypes. IDEPI can apply learned models to classify sequences of unknown phenotypes, and also identify specific sequence features which contribute to a particular phenotype. We demonstrate that IDEPI achieves performance similar to or better than that of previously published approaches on four well-studied problems: finding the epitopes of broadly neutralizing antibodies (bNab), determining coreceptor tropism of the virus, identifying compartment-specific genetic signatures of the virus, and deducing drug-resistance associated mutations. The cross-platform Python source code (released under the GPL 3.0 license), documentation, issue tracking, and a pre-configured virtual machine for IDEPI can be found at https://github.com/veg/idepi.


Subject(s)
Antibodies, Neutralizing , Epitopes , HIV Antibodies/immunology , HIV-1 , Human Immunodeficiency Virus Proteins , AIDS Dementia Complex , Algorithms , Antibodies, Neutralizing/immunology , Computational Biology/methods , Drug Resistance, Viral , Epitopes/chemistry , Epitopes/immunology , HIV Infections/immunology , HIV Infections/virology , HIV-1/chemistry , HIV-1/immunology , Human Immunodeficiency Virus Proteins/chemistry , Human Immunodeficiency Virus Proteins/immunology , Humans , Machine Learning , Phenotype , Sequence Analysis, Protein/methods , Software
17.
Front Bioinform ; 4: 1400003, 2024.
Article in English | MEDLINE | ID: mdl-39086842

ABSTRACT

Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.

18.
bioRxiv ; 2024 Mar 14.
Article in English | MEDLINE | ID: mdl-38559140

ABSTRACT

Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained hetero-sexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.

19.
Sci Transl Med ; 14(633): eabk3445, 2022 Feb 23.
Article in English | MEDLINE | ID: mdl-35014856

ABSTRACT

SARS-CoV-2 evolution threatens vaccine- and natural infection-derived immunity as well as the efficacy of therapeutic antibodies. To improve public health preparedness, we sought to predict which existing amino acid mutations in SARS-CoV-2 might contribute to future variants of concern. We tested the predictive value of features comprising epidemiology, evolution, immunology, and neural network-based protein sequence modeling, and identified primary biological drivers of SARS-CoV-2 intra-pandemic evolution. We found evidence that ACE2-mediated transmissibility and resistance to population-level host immunity has waxed and waned as a primary driver of SARS-CoV-2 evolution over time. We retroactively identified with high accuracy (area under the receiver operator characteristic curve, AUROC=0.92-0.97) mutations that will spread, at up to four months in advance, across different phases of the pandemic. The behavior of the model was consistent with a plausible causal structure wherein epidemiological covariates combine the effects of diverse and shifting drivers of viral fitness. We applied our model to forecast mutations that will spread in the future and characterize how these mutations affect the binding of therapeutic antibodies. These findings demonstrate that it is possible to forecast the driver mutations that could appear in emerging SARS-CoV-2 variants of concern. We validate this result against Omicron, showing elevated predictive scores for its component mutations prior to emergence, and rapid score increase across daily forecasts during emergence. This modeling approach may be applied to any rapidly evolving pathogens with sufficiently dense genomic surveillance data, such as influenza, and unknown future pandemic viruses.


Subject(s)
COVID-19 , SARS-CoV-2 , COVID-19/virology , Humans , Mutation , Pandemics , SARS-CoV-2/genetics
20.
bioRxiv ; 2022 Jan 18.
Article in English | MEDLINE | ID: mdl-35075456

ABSTRACT

Among the 30 non-synonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (i) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (ii) interactions of Spike with ACE2 receptors, and (iii) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any genomes within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron over all previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.

SELECTION OF CITATIONS
SEARCH DETAIL