Your browser doesn't support javascript.
loading
Halvade somatic: Somatic variant calling with Apache Spark.
Decap, Dries; de Schaetzen van Brienen, Louise; Larmuseau, Maarten; Costanza, Pascal; Herzeel, Charlotte; Wuyts, Roel; Marchal, Kathleen; Fostier, Jan.
Afiliação
  • Decap D; IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium.
  • de Schaetzen van Brienen L; IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium.
  • Larmuseau M; IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium.
  • Costanza P; Intel, Veldkant 31, B-2550 Kontich, Belgium.
  • Herzeel C; imec, Kapeldreef 75, B-3001 Leuven, Belgium.
  • Wuyts R; imec, Kapeldreef 75, B-3001 Leuven, Belgium.
  • Marchal K; IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium.
  • Fostier J; IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium.
Gigascience ; 11(1)2022 01 12.
Article em En | MEDLINE | ID: mdl-35022699
ABSTRACT

BACKGROUND:

The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample.

FINDINGS:

We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices

recommendations:

read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud.

CONCLUSIONS:

To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Software / Sequenciamento de Nucleotídeos em Larga Escala Tipo de estudo: Guideline Idioma: En Revista: Gigascience Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Bélgica País de publicação: EEUU / ESTADOS UNIDOS / ESTADOS UNIDOS DA AMERICA / EUA / UNITED STATES / UNITED STATES OF AMERICA / US / USA

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Software / Sequenciamento de Nucleotídeos em Larga Escala Tipo de estudo: Guideline Idioma: En Revista: Gigascience Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Bélgica País de publicação: EEUU / ESTADOS UNIDOS / ESTADOS UNIDOS DA AMERICA / EUA / UNITED STATES / UNITED STATES OF AMERICA / US / USA