Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model.

Han, Seong Kyu; Muto, Yoshiharu; Wilson, Parker C; Humphreys, Benjamin D; Sampson, Matthew G; Chakravarti, Aravinda; Lee, Dongwon

Han, Seong Kyu; Muto, Yoshiharu; Wilson, Parker C; Humphreys, Benjamin D; Sampson, Matthew G; Chakravarti, Aravinda; Lee, Dongwon.

Han SK; Department of Pediatrics, Division of Nephrology, Boston Children's Hospital, Boston & Harvard Medical School, Boston, MA 02115.
Muto Y; Kidney Disease Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142.
Wilson PC; Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63130.
Humphreys BD; Department of Pathology and Immunology, Washington University in St. Louis, St. Louis, MO 63130.
Sampson MG; Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO 63130.
Chakravarti A; Department of Developmental Biology, Washington University in St. Louis, St. Louis, MO 63130.
Lee D; Department of Pediatrics, Division of Nephrology, Boston Children's Hospital, Boston & Harvard Medical School, Boston, MA 02115.

Proc Natl Acad Sci U S A ; 119(51): e2212810119, 2022 12 20.

Article en En | MEDLINE | ID: mdl-36508674

ABSTRACT

ABSTRACT

Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify "high-quality" (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.

Asunto(s)

Cromatina; Secuencias Reguladoras de Ácidos Nucleicos; Cromatina/genética; Secuencias Reguladoras de Ácidos Nucleicos/genética; Análisis de Secuencia de ADN/métodos; Regulación de la Expresión Génica; Genoma

Palabras clave

chromatin accessibility; gkmQC; quality control; sequence-based model

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Cromatina / Secuencias Reguladoras de Ácidos Nucleicos Tipo de estudio: Prognostic_studies / Risk_factors_studies Idioma: En Año: 2022 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Search on Google