Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
Mais filtros










Intervalo de ano de publicação
1.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-519890

RESUMO

MotivationTracking SARS-CoV-2 variants through genomic sequencing has been an important part of the global response to the pandemic. As well as whole-genome sequencing of clinical samples, this surveillance effort has been aided by amplicon sequencing of wastewater samples, which proved effective in real case studies. Because of its relevance to public healthcare decisions, testing and benchmarking wastewater sequencing analysis methods is also crucial, which necessitates a simulator. Although metagenomic simulators exist, none are fit for the purpose of simulating the metagenomes produced through amplicon sequencing of wastewater. ResultsOur new simulation tool, SWAMPy (Simulating SARS-CoV-2 Wastewater Amplicon Metagenomes with Python), is intended to provide realistic simulated SARS-CoV-2 wastewater sequencing datasets with which other programs that rely on this type of data can be evaluated and improved. AvailabilityThe code for this project is available at https://github.com/goldman-gp-ebi/SWAMPy. It can be installed on any Unix-based operating system and is available under the GPL-v3 license.

2.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-498932

RESUMO

Bayesian phylogeographic inference is a powerful tool in molecular epidemiological studies that enables reconstructing the origin and subsequent geographic spread of pathogens. Such inference is, however, potentially affected by geographic sampling bias. Here, we investigated the impact of sampling bias on the spatiotemporal reconstruction of viral epidemics using Bayesian discrete phylogeographic models and explored different operational strategies to mitigate this impact. We considered the continuous-time Markov chain (CTMC) model and two structured coalescent approximations (BASTA and MASCOT). For each approach, we compared the estimated and simulated spatiotemporal histories in biased and unbiased conditions based on simulated epidemics of rabies virus (RABV) in dogs in Morocco. While the reconstructed spatiotemporal histories were impacted by sampling bias for the three approaches, BASTA and MASCOT reconstructions were also biased when employing unbiased samples. Increasing the number of analyzed genomes led to more robust estimates at low sampling bias for CTMC. Alternative sampling strategies that maximize the spatiotemporal coverage greatly improved the inference at intermediate sampling bias for CTMC, and to a lesser extent, for BASTA and MASCOT. In contrast, allowing for time-varying population sizes in MASCOT resulted in robust inference. We further applied these approaches to two empirical datasets: a RABV dataset from the Philippines and a SARS-CoV-2 dataset describing its early spread across the world. In conclusion, sampling biases are ubiquitous in phylogeographic analyses but may be accommodated by increasing sample size, balancing spatial and temporal composition in the samples, and informing structured coalescent models with reliable case count data.

3.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-485312

RESUMO

Phylogenetics plays a crucial role in the interpretation of genomic data1. Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the viruss origins2, of its international3,4 and local4-9 spread, and of the emergence10 and reproductive success11 of new variants, among many applications. These analyses have been enabled by the unparalleled volumes of genome sequence data generated and employed to study and help contain the pandemic12. However, preferred model-based phylogenetic approaches including maximum likelihood and Bayesian methods, mostly based on Felsensteins pruning algorithm13,14, cannot scale to the size of the datasets from the current pandemic4,15, hampering our understanding of the viruss evolution and transmission16. We present new approaches, based on reworking Felsensteins algorithm, for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. We exploit near-certainty regarding ancestral genomes, and the similarities between closely related and densely sampled genomes, to greatly reduce computational demands for memory and time. Combined with new methods for searching amongst candidate evolutionary trees, this results in our MAPLE ( MAximum Parsimonious Likelihood Estimation) software giving better results than popular approaches such as FastTree 217, IQ-TREE 218, RAxML-NG19 and UShER15. Our approach therefore allows complex and accurate proba-bilistic phylogenetic analyses of millions of microbial genomes, extending the reach of genomic epidemiology. Future epidemiological datasets are likely to be even larger than those currently associated with COVID-19, and other disciplines such as metagenomics and biodiversity science are also generating huge numbers of genome sequences20-22. Our methods will permit continued use of preferred likelihood-based phylogenetic analyses.

4.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-471004

RESUMO

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 10 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) methods are more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, and ML and MP frameworks, for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimizations produce more accurate SARS-CoV-2 phylogenies than do ML optimizations. Since MP is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo, we therefore propose that, in the context of comprehensive genomic epidemiology of SARS-CoV-2, MP online phylogenetics approaches should be favored.

5.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-455157

RESUMO

Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral recombination. Low SARS-CoV-2 mutation rates make detecting recombination difficult. Here, we develop and apply a novel phylogenomic method to exhaustively search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. We investigate a 1.6M sample tree, and identify 606 recombination events. Approximately 2.7% of sequenced SARS-CoV-2 genomes have recombinant ancestry. Recombination breakpoints occur disproportionately in the Spike protein region. Our method empowers comprehensive real time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.

6.
Preprint em Inglês | medRxiv | ID: ppmedrxiv-21257633

RESUMO

The evolution of the SARS-CoV-2 pandemic continuously produces new variants, which warrant timely epidemiological characterisation. Here we use the dense genomic surveillance generated by the COVID-19 Genomics UK Consortium to reconstruct the dynamics of 71 different lineages in each of 315 English local authorities between September 2020 and June 2021. This analysis reveals a series of sub-epidemics that peaked in the early autumn of 2020, followed by a jump in transmissibility of the B.1.1.7/Alpha lineage. Alpha grew when other lineages declined during the second national lockdown and regionally tiered restrictions between November and December 2020. A third more stringent national lockdown suppressed Alpha and eliminated nearly all other lineages in early 2021. However, a series of variants (mostly containing the spike E484K mutation) defied these trends and persisted at moderately increasing proportions. Accounting for sustained introductions, however, indicates that their transmissibility is unlikely to have exceeded that of Alpha. Finally, B.1.617.2/Delta was repeatedly introduced to England and grew rapidly in the early summer of 2021, constituting approximately 98% of sampled SARS-CoV-2 genomes on June 26.

7.
Preprint em Inglês | medRxiv | ID: ppmedrxiv-21255891

RESUMO

Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape. The code is freely available at https://github.com/Genomics-HSE/VGsim.

8.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-436637

RESUMO

In the absence of effective treatment, COVID-19 is likely to remain a global disease burden. Compounding this threat is the near certainty that novel coronaviruses with pandemic potential will emerge in years to come. Pan-coronavirus drugs - agents active against both SARS-CoV-2 and other coronaviruses - would address both threats. A strategy to develop such broad-spectrum inhibitors is to pharmacologically target binding sites on SARS-CoV-2 proteins that are highly conserved in other known coronaviruses, the assumption being that any selective pressure to keep a site conserved across past viruses will apply to future ones. Here, we systematically mapped druggable binding pockets on the experimental structure of fifteen SARS-CoV-2 proteins and analyzed their variation across twenty-seven - and {beta}-coronaviruses and across thousands of SARS-CoV-2 samples from COVID-19 patients. We find that the two most conserved druggable sites are a pocket overlapping the RNA binding site of the helicase nsp13, and the catalytic site of the RNA-dependent RNA polymerase nsp12, both components of the viral replication-transcription complex. We present the data on a public web portal (https://www.thesgc.org/SARSCoV2_pocketome/) where users can interactively navigate individual protein structures and view the genetic variability of drug binding pockets in 3D.

9.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-435416

RESUMO

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. < 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution. Author summaryOne of the most influential responses to the SARS-CoV-2 pandemic has been the widespread adoption of genome sequencing to keep track of viral spread and evolution. This has resulted in vast availability of genomic sequence data, that, while extremely useful and promising, is also increasingly hard to store and process efficiently. An important task in the processing of this genetic data is simulation, that is, recreating potential histories of past and future virus evolution, to benchmark data analysis methods and make statistical inference. Here, we address the problem of efficiently simulating large numbers of closely related genomes, similar to those sequenced during SARS-CoV-2 pandemic, or indeed to most scenarios in genomic epidemiology. We develop a new algorithm to perform this task, that provides not only computational efficiency, but also extreme flexibility in terms of possible evolutionary models, allowing variation in mutation rates, non-stationary evolution, and indels; all phenomena that play an important role in SARS-CoV-2 evolution, as well as many other real-life epidemiological scenarios.

10.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-426705

RESUMO

AO_SCPLOWBSTRACTC_SCPLOWThe COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of viral evolution, for vaccine design, and for the tracking of viral spread. We highlight and address some common genome sequence analysis pitfalls that can lead to inaccurate inference of mutation rates and selection, such as ignoring skews in the genetic code, not accounting for recurrent mutations, and assuming evolutionary equilibrium. We find that two particular mutation rates, G[->]U and C[->]U, are similarly elevated and considerably higher than all other mutation rates, causing the majority of mutations in the SARS-CoV-2 genome, and are possibly the result of APOBEC and ROS activity. These mutations also tend to occur many times at the same genome positions along the global SARS-CoV-2 phylogeny (i.e., they are very homoplasic). We observe an effect of genomic context on mutation rates, but the effect of the context is overall limited. While previous studies have suggested selection acting to decrease U content at synonymous sites, we bring forward evidence suggesting the opposite.

11.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-314971

RESUMO

As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering a new era of "genomic contact tracing" - that is, using viral genome sequences to trace local transmission dynamics. However, because the viral phylogeny is already so large - and will undoubtedly grow many fold - placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient, tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach improves the speed of phylogenetic placement of new samples and data visualization by orders of magnitude, making it possible to complete the placements under real-time constraints. Our method also provides the key ingredient for maintaining a fully-updated reference phylogeny. We make these tools available to the research community through the UCSC SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for laboratories worldwide. Software AvailabilityUSHER is available to users through the UCSC Genome Browser at https://genome.ucsc.edu/cgi-bin/hgPhyloPlace. The source code and detailed instructions on how to compile and run UShER are available from https://github.com/yatisht/usher.

12.
Preprint em Inglês | bioRxiv | ID: ppbiorxiv-078758

RESUMO

Since the start of the COVID-19 pandemic, an unprecedented number of genomic sequences of the causative virus (SARS-CoV-2) have been generated and shared with the scientific community. The unparalleled volume of available genetic data presents a unique opportunity to gain real-time insights into the virus transmission during the pandemic, but also a daunting computational hurdle if analysed with gold-standard phylogeographic approaches. We here describe and apply an analytical pipeline that is a compromise between fast and rigorous analytical steps. As a proof of concept, we focus on the Belgium epidemic, with one of the highest spatial density of available SARS-CoV-2 genomes. At the global scale, our analyses confirm the importance of external introduction events in establishing multiple transmission chains in the country. At the country scale, our spatially-explicit phylogeographic analyses highlight that the national lockdown had a relatively low impact on both the lineage dispersal velocity and the long-distance dispersal events within Belgium. Our pipeline has the potential to be quickly applied to other countries or regions, with key benefits in complementing epidemiological analyses in assessing the impact of intervention measures or their progressive easement.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...