Comparison and benchmark of structural variants detected from long read and long-read assembly.

Lin, Jiadong; Jia, Peng; Wang, Songbo; Kosters, Walter; Ye, Kai

Lin, Jiadong; Jia, Peng; Wang, Songbo; Kosters, Walter; Ye, Kai.

Afiliação

Lin J; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Jia P; School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Wang S; Genome Institute, the First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061 China.
Kosters W; Leiden Institute of Advanced Computer Science, Faculty of Science, Leiden University, Leiden 2311 EZ, The Netherlands.
Ye K; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.

Brief Bioinform ; 24(4)2023 07 20.

Article em En | MEDLINE | ID: mdl-37200087

ABSTRACT

ABSTRACT

Structural variant (SV) detection is essential for genomic studies, and long-read sequencing technologies have advanced our capacity to detect SVs directly from read or de novo assembly, also known as read-based and assembly-based strategy. However, to date, no independent studies have compared and benchmarked the two strategies. Here, on the basis of SVs detected by 20 read-based and eight assembly-based detection pipelines from six datasets of HG002 genome, we investigated the factors that influence the two strategies and assessed their performance with well-curated SVs. We found that up to 80% of the SVs could be detected by both strategies among different long-read datasets, whereas variant type, size, and breakpoint detected by read-based strategy were greatly affected by aligners. For the high-confident insertions and deletions at non-tandem repeat regions, a remarkable subset of them (82% in assembly-based calls and 93% in read-based calls), accounting for around 4000 SVs, could be captured by both reads and assemblies. However, discordance between two strategies was largely caused by complex SVs and inversions, which resulted from inconsistent alignment of reads and assemblies at these loci. Finally, benchmarking with SVs at medically relevant genes, the recall of read-based strategy reached 77% on 5X coverage data, whereas assembly-based strategy required 20X coverage data to achieve similar performance. Therefore, integrating SVs from read and assembly is suggested for general-purpose detection because of inconsistently detected complex SVs and inversions, whereas assembly-based strategy is optional for applications with limited resources.

Assuntos

Benchmarking; Genoma Humano; Humanos; Análise de Sequência; Genômica/métodos; Análise de Sequência de DNA/métodos; Sequenciamento de Nucleotídeos em Larga Escala/métodos

Palavras-chave

long-read sequencing; sequence assembly; structural variant detection

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Genoma Humano / Benchmarking Limite: Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Genoma Humano / Benchmarking Limite: Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article