A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data.

Chan, Ariel W; Williams, Amy L; Jannink, Jean-Luc

Chan, Ariel W; Williams, Amy L; Jannink, Jean-Luc.

Afiliación

Chan AW; Section of Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, 407 Bradfield Hall, Ithaca, NY, 14853, USA. ac2278@cornell.edu.
Williams AL; Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, 14853, USA.
Jannink JL; RW Holley Center for Agriculture and Health, United States Department of Agriculture -- Agricultural Research Service, School of Integrative Plant Sciences, Cornell University, 258 Emerson Hall, Ithaca, NY, 14853, USA.

BMC Bioinformatics ; 19(1): 478, 2018 Dec 12.

Article en En | MEDLINE | ID: mdl-30541436

ABSTRACT

ABSTRACT

BACKGROUND:

Researchers typically sequence a given individual multiple times, either re-sequencing the same DNA sample (technical replication) or sequencing different DNA samples collected on the same individual (biological replication) or both. Before merging the data from these replicate sequence runs, it is important to verify that no errors, such as DNA contamination or mix-ups, occurred during the data collection pipeline. Methods to detect such errors exist but are often ad hoc, cannot handle missing data and several require phased data. Because they require some combination of genotype calling, imputation, and haplotype phasing, these methods are unsuitable for error detection in low- to moderate-depth sequence data where such tasks are difficult to perform accurately. Additionally, because most existing methods employ a pairwise-comparison approach for error detection rather than joint analysis of the putative replicates, results may be difficult to interpret.

RESULTS:

We introduce a new method for error detection suitable for shallow-, moderate-, and high-depth sequence data. Using Bayes Theorem, we calculate the posterior probability distribution over the set of relations describing the putative replicates and infer which of the samples originated from an identical genotypic source.

CONCLUSIONS:

Our method addresses key limitations of existing approaches and produced highly accurate results in simulation experiments. Our method is implemented as an R package called BIGRED (Bayes Inferred Genotype Replicate Error Detector), which is freely available for download https//github.com/ac2278/BIGRED .

Asunto(s)

Bases de Datos de Ácidos Nucleicos/normas; Análisis de Secuencia de ADN/métodos; Humanos

Palabras clave

Biological replication; Error detection; Mislabeled samples; Shallow-depth sequence data; Technical replication

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Análisis de Secuencia de ADN / Bases de Datos de Ácidos Nucleicos Límite: Humans Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2018 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google