Family reunion via error correction: an efficient analysis of duplex sequencing data.

Stoler, Nicholas; Arbeithuber, Barbara; Povysil, Gundula; Heinzl, Monika; Salazar, Renato; Makova, Kateryna D; Tiemann-Boege, Irene; Nekrutenko, Anton

Stoler, Nicholas; Arbeithuber, Barbara; Povysil, Gundula; Heinzl, Monika; Salazar, Renato; Makova, Kateryna D; Tiemann-Boege, Irene; Nekrutenko, Anton.

Afiliación

Stoler N; Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA, USA.
Arbeithuber B; Department of Biology, The Pennsylvania State University, University Park, PA, USA.
Povysil G; Institut für Biophysik, Johannes Kepler Universität, Linz, Österreich, Austria.
Heinzl M; Present Address: Institute for Genomic Medicine, Columbia University Irving Medical Center, New York, NY, USA.
Salazar R; Institut für Biophysik, Johannes Kepler Universität, Linz, Österreich, Austria.
Makova KD; Institut für Biophysik, Johannes Kepler Universität, Linz, Österreich, Austria.
Tiemann-Boege I; Department of Biology, The Pennsylvania State University, University Park, PA, USA. kdm16@psu.edu.
Nekrutenko A; Institut für Biophysik, Johannes Kepler Universität, Linz, Österreich, Austria. irene.tiemann@jku.at.

BMC Bioinformatics ; 21(1): 96, 2020 Mar 04.

Article en En | MEDLINE | ID: mdl-32131723

ABSTRACT

ABSTRACT

BACKGROUND:

Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost-sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away.

RESULTS:

In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows "reuniting" these reads with their respective families increasing the output of the method and making it more cost effective.

CONCLUSIONS:

We combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github https//github.com/galaxyproject/dunovo.

Asunto(s)

Interfaz Usuario-Computador; Algoritmos; ADN/química; ADN/metabolismo; Humanos; Alineación de Secuencia; Análisis de Secuencia de ADN

Palabras clave

Barcodes; Duplex sequence; Error correction; Low frequency variants

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Interfaz Usuario-Computador Límite: Humans Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2020 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google