Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments.
Nucleic Acids Res
; 43(20): e133, 2015 Nov 16.
Article
in En
| MEDLINE
| ID: mdl-26163063
ABSTRACT
The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets.
Full text:
1
Collection:
01-internacional
Database:
MEDLINE
Main subject:
Genome, Human
/
Sequence Alignment
/
Artifacts
/
Genomics
Type of study:
Evaluation_studies
Limits:
Humans
Language:
En
Journal:
Nucleic Acids Res
Year:
2015
Type:
Article
Affiliation country:
United States