Your browser doesn't support javascript.
loading
SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications.
Becker, Devan; Champredon, David; Chato, Connor; Gugan, Gopi; Poon, Art.
Affiliation
  • Becker D; Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
  • Champredon D; Public Health Agency of Canada, National Microbiology Laboratory, Public Health Risk Sciences Division, Guelph, Ontario, Canada.
  • Chato C; Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
  • Gugan G; Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
  • Poon A; Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
NAR Genom Bioinform ; 5(2): lqad038, 2023 Jun.
Article in En | MEDLINE | ID: mdl-37101658
ABSTRACT
Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: NAR Genom Bioinform Year: 2023 Document type: Article Affiliation country:

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: NAR Genom Bioinform Year: 2023 Document type: Article Affiliation country: