When less is more: 'slicing' sequencing data improves read decoding accuracy and de novo assembly quality.

Lonardi, Stefano; Mirebrahim, Hamid; Wanamaker, Steve; Alpert, Matthew; Ciardo, Gianfranco; Duma, Denisa; Close, Timothy J

Lonardi, Stefano; Mirebrahim, Hamid; Wanamaker, Steve; Alpert, Matthew; Ciardo, Gianfranco; Duma, Denisa; Close, Timothy J.

Afiliação

Lonardi S; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Mirebrahim H; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Wanamaker S; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Alpert M; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Ciardo G; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Duma D; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA Department of Computer Science and Engi
Close TJ; Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.

Bioinformatics ; 31(18): 2972-80, 2015 Sep 15.

Article em En | MEDLINE | ID: mdl-25995232

ABSTRACT

ABSTRACT

MOTIVATION As the invention of DNA sequencing in the 70s, computational biologists have had to deal with the problem of de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing.

RESULTS:

We explore the effect of ultra-deep sequencing data in two domains (i) the problem of decoding reads to bacterial artificial chromosome (BAC) clones (in the context of the combinatorial pooling design we have recently proposed), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on 'divide and conquer' we 'slice' a large dataset into smaller samples of optimal size, decode each slice independently, and then merge the results. Experimental results on over 15 000 barley BACs and over 4000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data. AVAILABILITY AND IMPLEMENTATION Python scripts to process slices and resolve decoding conflicts are available from http//goo.gl/YXgdHT; software Hashfilter can be downloaded from http//goo.gl/MIyZHs CONTACT stelo@cs.ucr.edu or timothy.close@ucr.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos; Biologia Computacional/métodos; Fabaceae/genética; Sequenciamento de Nucleotídeos em Larga Escala/métodos; Hordeum/genética; Análise de Sequência de DNA/métodos; Software; Cromossomos Artificiais Bacterianos; Alinhamento de Sequência

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Hordeum / Algoritmos / Software / Análise de Sequência de DNA / Biologia Computacional / Sequenciamento de Nucleotídeos em Larga Escala / Fabaceae Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2015 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google