Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms.

Guiglielmoni, Nadège; Houtain, Antoine; Derzelle, Alessandro; Van Doninck, Karine; Flot, Jean-François

Guiglielmoni, Nadège; Houtain, Antoine; Derzelle, Alessandro; Van Doninck, Karine; Flot, Jean-François.

Afiliación

Guiglielmoni N; Service Evolution Biologique et Ecologie, Université libre de Bruxelles (ULB), Avenue Franklin D. Roosevelt 50, 1050, Brussels, Belgium. nadege.guiglielmoni@ulb.be.
Houtain A; Laboratoire d'Ecologie et Génétique Evolutive, Université de Namur, Rue de Bruxelles 61, 5000, Namur, Belgium.
Derzelle A; Laboratoire d'Ecologie et Génétique Evolutive, Université de Namur, Rue de Bruxelles 61, 5000, Namur, Belgium.
Van Doninck K; Laboratoire d'Ecologie et Génétique Evolutive, Université de Namur, Rue de Bruxelles 61, 5000, Namur, Belgium.
Flot JF; Département de Biologie des Organismes, Université libre de Bruxelles (ULB), Avenue Franklin D. Roosevelt 50, 1050, Brussels, Belgium.

BMC Bioinformatics ; 22(1): 303, 2021 Jun 05.

Article en En | MEDLINE | ID: mdl-34090340

RESUMEN

BACKGROUND: Long-read sequencing is revolutionizing genome assembly: as PacBio and Nanopore technologies become more accessible in technicity and in cost, long-read assemblers flourish and are starting to deliver chromosome-level assemblies. However, these long reads are usually error-prone, making the generation of a haploid reference out of a diploid genome a difficult enterprise. Failure to properly collapse haplotypes results in fragmented and structurally incorrect assemblies and wreaks havoc on orthology inference pipelines, yet this serious issue is rarely acknowledged and dealt with in genomic projects, and an independent, comparative benchmark of the capacity of assemblers and post-processing tools to properly collapse or purge haplotypes is still lacking. RESULTS: We tested different assembly strategies on the genome of the rotifer Adineta vaga, a non-model organism for which high coverages of both PacBio and Nanopore reads were available. The assemblers we tested (Canu, Flye, NextDenovo, Ra, Raven, Shasta and wtdbg2) exhibited strikingly different behaviors when dealing with highly heterozygous regions, resulting in variable amounts of uncollapsed haplotypes. Filtering reads generally improved haploid assemblies, and we also benchmarked three post-processing tools aimed at detecting and purging uncollapsed haplotypes in long-read assemblies: HaploMerger2, purge_haplotigs and purge_dups. CONCLUSIONS: We provide a thorough evaluation of popular assemblers on a non-model eukaryote genome with variable levels of heterozygosity. Our study highlights several strategies using pre and post-processing approaches to generate haploid assemblies with high continuity and completeness. This benchmark will help users to improve haploid assemblies of non-model organisms, and evaluate the quality of their own assemblies.

Asunto(s)

Genómica; Secuenciación de Nucleótidos de Alto Rendimiento; Genoma; Haplotipos; Análisis de Secuencia de ADN

Palabras clave

Genome assembly; Haplotype collapsing; Long reads

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Genómica / Secuenciación de Nucleótidos de Alto Rendimiento Tipo de estudio: Prognostic_studies Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2021 Tipo del documento: Article País de afiliación: Bélgica

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google