Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments.

Miga, Karen H; Eisenhart, Christopher; Kent, W James

Miga, Karen H; Eisenhart, Christopher; Kent, W James.

Affiliation

Miga KH; Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA khmiga@soe.ucsc.edu.
Eisenhart C; Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
Kent WJ; Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

Nucleic Acids Res ; 43(20): e133, 2015 Nov 16.

Article in En | MEDLINE | ID: mdl-26163063

ABSTRACT

ABSTRACT

The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets.

Subject(s)

Artifacts; Genome, Human; Genomics/methods; Sequence Alignment/methods; DNA/chemistry; Databases, Nucleic Acid; Humans; Repetitive Sequences, Nucleic Acid

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Genome, Human / Sequence Alignment / Artifacts / Genomics Type of study: Evaluation_studies Limits: Humans Language: En Journal: Nucleic Acids Res Year: 2015 Type: Article Affiliation country: United States

Fulltext

XML

PubMed Links

Search on Google