Effective filtering strategies to improve data quality from population-based whole exome sequencing studies.

Carson, Andrew R; Smith, Erin N; Matsui, Hiroko; Brækkan, Sigrid K; Jepsen, Kristen; Hansen, John-Bjarne; Frazer, Kelly A

Carson, Andrew R; Smith, Erin N; Matsui, Hiroko; Brækkan, Sigrid K; Jepsen, Kristen; Hansen, John-Bjarne; Frazer, Kelly A.

Afiliación

Frazer KA; Department of Pediatrics and Rady Children's Hospital, University of California San Diego, San Diego, USA. kafrazer@ucsd.edu.

BMC Bioinformatics ; 15: 125, 2014 May 02.

Article en En | MEDLINE | ID: mdl-24884706

RESUMEN

BACKGROUND: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK's recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone. RESULTS: The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes. CONCLUSIONS: The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.

Asunto(s)

Exoma; Secuenciación de Nucleótidos de Alto Rendimiento/métodos; Análisis de Secuencia de ADN/métodos; Genotipo; Secuenciación de Nucleótidos de Alto Rendimiento/normas; Humanos; Polimorfismo de Nucleótido Simple; Análisis de Secuencia de ADN/normas

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Análisis de Secuencia de ADN / Secuenciación de Nucleótidos de Alto Rendimiento / Exoma Tipo de estudio: Guideline / Prognostic_studies Límite: Humans Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2014 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google