Unravelling reference bias in ancient DNA datasets.

Dolenz, Stephanie; van der Valk, Tom; Jin, Chenyu; Oppenheimer, Jonas; Sharif, Muhammad Bilal; Orlando, Ludovic; Shapiro, Beth; Dalén, Love; Heintzman, Peter D

Dolenz, Stephanie; van der Valk, Tom; Jin, Chenyu; Oppenheimer, Jonas; Sharif, Muhammad Bilal; Orlando, Ludovic; Shapiro, Beth; Dalén, Love; Heintzman, Peter D.

Affiliation

Dolenz S; Centre for Palaeogenetics, Svante Arrhenius väg 20C, Stockholm, SE-106 91, Sweden.
van der Valk T; Department of Geological Sciences, Stockholm University, Stockholm, SE-106 91, Sweden.
Jin C; Centre for Palaeogenetics, Svante Arrhenius väg 20C, Stockholm, SE-106 91, Sweden.
Oppenheimer J; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, SE-114 18, Sweden.
Sharif MB; Science for Life Laboratory, Stockholm, SE-171 65, Sweden.
Orlando L; Centre for Palaeogenetics, Svante Arrhenius väg 20C, Stockholm, SE-106 91, Sweden.
Shapiro B; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, SE-114 18, Sweden.
Dalén L; Department of Zoology, Stockholm University, Stockholm, SE-106 91, Sweden.
Heintzman PD; Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.

Bioinformatics ; 40(7)2024 07 01.

Article in En | MEDLINE | ID: mdl-38960861

ABSTRACT

ABSTRACT

MOTIVATION The alignment of sequencing reads is a critical step in the characterization of ancient genomes. However, reference bias and spurious mappings pose a significant challenge, particularly as cutting-edge wet lab methods generate datasets that push the boundaries of alignment tools. Reference bias occurs when reference alleles are favoured over alternative alleles during mapping, whereas spurious mappings stem from either contamination or when endogenous reads fail to align to their correct position. Previous work has shown that these phenomena are correlated with read length but a more thorough investigation of reference bias and spurious mappings for ancient DNA has been lacking. Here, we use a range of empirical and simulated palaeogenomic datasets to investigate the impacts of mapping tools, quality thresholds, and reference genome on mismatch rates across read lengths.

RESULTS:

For these analyses, we introduce AMBER, a new bioinformatics tool for assessing the quality of ancient DNA mapping directly from BAM-files and informing on reference bias, read length cut-offs and reference selection. AMBER rapidly and simultaneously computes the sequence read mapping bias in the form of the mismatch rates per read length, cytosine deamination profiles at both CpG and non-CpG sites, fragment length distributions, and genomic breadth and depth of coverage. Using AMBER, we find that mapping algorithms and quality threshold choices dictate reference bias and rates of spurious alignment at different read lengths in a predictable manner, suggesting that optimized mapping parameters for each read length will be a key step in alleviating reference bias and spurious mappings. AVAILABILITY AND IMPLEMENTATION AMBER is available for noncommercial use on GitHub (https//github.com/tvandervalk/AMBER.git). Scripts used to generate and analyse simulated datasets are available on Github (https//github.com/sdolenz/refbias_scripts).

Subject(s)

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Sequence Analysis, DNA / DNA, Ancient Limits: Animals / Humans Language: En Journal: Bioinformatics Journal subject: INFORMATICA MEDICA Year: 2024 Document type: Article Affiliation country: Country of publication:

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google