Identifying and removing haplotypic duplication in primary genome assemblies.

Guan, Dengfeng; McCarthy, Shane A; Wood, Jonathan; Howe, Kerstin; Wang, Yadong; Durbin, Richard

Guan, Dengfeng; McCarthy, Shane A; Wood, Jonathan; Howe, Kerstin; Wang, Yadong; Durbin, Richard.

Afiliación

Guan D; Department of Computer Science and Technology, Center for Bioinformatics, Harbin Institute of Technology, Harbin 150001, China.
McCarthy SA; Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK.
Wood J; Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK.
Howe K; Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK.
Wang Y; Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK.
Durbin R; Department of Computer Science and Technology, Center for Bioinformatics, Harbin Institute of Technology, Harbin 150001, China.

Bioinformatics ; 36(9): 2896-2898, 2020 05 01.

Article en En | MEDLINE | ID: mdl-31971576

RESUMEN

MOTIVATION: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. RESULTS: Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. AVAILABILITY AND IMPLEMENTATION: The source code is written in C and is available at https://github.com/dfguan/purge_dups. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento; Programas Informáticos; Genoma; Haplotipos; Análisis de Secuencia de ADN

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Secuenciación de Nucleótidos de Alto Rendimiento Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2020 Tipo del documento: Article País de afiliación: China

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google