Your browser doesn't support javascript.
loading
Improved detection of low-frequency within-host variants from deep sequencing: A case study with human papillomavirus.
Mishra, Sambit K; Nelson, Chase W; Zhu, Bin; Pinheiro, Maisa; Lee, Hyo Jung; Dean, Michael; Burdett, Laurie; Yeager, Meredith; Mirabello, Lisa.
Affiliation
  • Mishra SK; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
  • Nelson CW; Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA.
  • Zhu B; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
  • Pinheiro M; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
  • Lee HJ; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
  • Dean M; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
  • Burdett L; Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA.
  • Yeager M; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
  • Mirabello L; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.
Virus Evol ; 10(1): veae013, 2024.
Article in En | MEDLINE | ID: mdl-38455683
ABSTRACT
High-coverage sequencing allows the study of variants occurring at low frequencies within samples, but is susceptible to false-positives caused by sequencing error. Ion Torrent has a very low single nucleotide variant (SNV) error rate and has been employed for the majority of human papillomavirus (HPV) whole genome sequences. However, benchmarking of intrahost SNVs (iSNVs) has been challenging, partly due to limitations imposed by the HPV life cycle. We address this problem by deep sequencing three replicates for each of 31 samples of HPV type 18 (HPV18). Errors, defined as iSNVs observed in only one of three replicates, are dominated by C→T (G→A) changes, independently of trinucleotide context. True iSNVs, defined as those observed in all three replicates, instead show a more diverse SNV type distribution, with particularly elevated C→T rates in CCG context (CCG→CTG; CGG→CAG) and C→A rates in ACG context (ACG→AAG; CGT→CTT). Characterization of true iSNVs allowed us to develop two methods for detecting true variants (1) VCFgenie, a dynamic binomial filtering tool which uses each variant's allele count and coverage instead of fixed frequency cut-offs; and (2) a machine learning binary classifier which trains eXtreme Gradient Boosting models on variant features such as quality and trinucleotide context. Each approach outperforms fixed-cut-off filtering of iSNVs, and performance is enhanced when both are used together. Our results provide improved methods for identifying true iSNVs in within-host applications across sequencing platforms, specifically using HPV18 as a case study.
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: Virus Evol Year: 2024 Document type: Article Affiliation country: United States

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: Virus Evol Year: 2024 Document type: Article Affiliation country: United States