BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer.

Huang, Neng; Nie, Fan; Ni, Peng; Gao, Xin; Luo, Feng; Wang, Jianxin

Huang, Neng; Nie, Fan; Ni, Peng; Gao, Xin; Luo, Feng; Wang, Jianxin.

Afiliação

Huang N; School of Computer Science and Engineering, Central South University, China.
Nie F; School of Computer Science and Engineering, Central South University, China.
Ni P; School of Computer Science and Engineering, Central South University, China.
Gao X; School of Computer Science, King Abdullah University of Science and Technology, Saudi Arabia.
Luo F; School of Computing, Clemson University, USA.
Wang J; School of Computer Science and Engineering, Central South University, China.

Brief Bioinform ; 23(1)2022 01 17.

Article em En | MEDLINE | ID: mdl-34619757

ABSTRACT

ABSTRACT

Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https//github.com/huangnengCSU/BlockPolish).

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala; Software; Sequenciamento de Nucleotídeos em Larga Escala/métodos; Reprodutibilidade dos Testes; Alinhamento de Sequência; Análise de Sequência de DNA/métodos

Palavras-chave

assembly polishing; block divide-and-conquer; long reads; neural network

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Software / Sequenciamento de Nucleotídeos em Larga Escala Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2022 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google