Pesquisa | Secretaria de Estado da Saúde

A comparison of three programming languages for a full-fledged next-generation sequencing tool.

Costanza, Pascal; Herzeel, Charlotte; Verachtert, Wilfried.

BMC Bioinformatics ; 20(1): 301, 2019 Jun 03.

Artigo em Inglês | MEDLINE | ID: mdl-31159721

RESUMO

BACKGROUND: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use. RESULTS: The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case. CONCLUSIONS: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Linguagens de Programação , Benchmarking , Humanos , Software , Fatores de Tempo

Halvade: scalable sequence analysis with MapReduce.

Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan.

Bioinformatics ; 31(15): 2482-8, 2015 Aug 01.

Artigo em Inglês | MEDLINE | ID: mdl-25819078

RESUMO

MOTIVATION: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. RESULTS: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

Assuntos

Análise de Sequência de DNA/métodos , Software , Genoma Humano , Humanos

A software package for efficient patient trajectory analysis applied to analyzing bladder cancer development.

Herzeel, Charlotte; D'Hondt, Ellie; Vandeweerd, Valerie; Botermans, Wouter; Akand, Murat; Van der Aa, Frank; Wuyts, Roel; Verachtert, Wilfried.

PLOS Digit Health ; 2(11): e0000384, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37992021

RESUMO

We present the Patient Trajectory Analysis Library (PTRA), a software package for explorative analysis of patient development. PTRA provides the tools for extracting statistically relevant trajectories from the medical event histories of a patient population. These trajectories can additionally be clustered for visual inspection and identifying key events in patient progression. The algorithms of PTRA are based on a statistical method developed previously by Jensen et al, but we contribute several modifications and extensions to enable the implementation of a practical tool. This includes a new clustering strategy, filter mechanisms for controlling analysis to specific cohorts and for controlling trajectory output, a parallel implementation that executes on a single server rather than a high-performance computing (HPC) cluster, etc. PTRA is furthermore open source and the code is organized as a framework so researchers can reuse it to analyze new data sets. We illustrate our tool by discussing trajectories extracted from the TriNetX Dataworks database for analyzing bladder cancer development. We show this experiment uncovers medically sound trajectories for bladder cancer.

Halvade somatic: Somatic variant calling with Apache Spark.

Decap, Dries; de Schaetzen van Brienen, Louise; Larmuseau, Maarten; Costanza, Pascal; Herzeel, Charlotte; Wuyts, Roel; Marchal, Kathleen; Fostier, Jan.

Gigascience ; 11(1)2022 01 12.

Artigo em Inglês | MEDLINE | ID: mdl-35022699

RESUMO

BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. FINDINGS: We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. CONCLUSIONS: To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Sequenciamento do Exoma , Sequenciamento Completo do Genoma

Multithreaded variant calling in elPrep 5.

Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Wuyts, Roel; Verachtert, Wilfried.

PLoS One ; 16(2): e0244471, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-33539352

RESUMO

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.

Assuntos

Exoma , Genoma Humano , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos , Sequenciamento do Exoma

Comparing Ease of Programming in C++, Go, and Java for Implementing a Next-Generation Sequencing Tool.

Costanza, Pascal; Herzeel, Charlotte; Verachtert, Wilfried.

Evol Bioinform Online ; 15: 1176934319869015, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31452597

RESUMO

elPrep is an extensible multithreaded software framework for efficiently processing Sequence Alignment/Map (SAM)/Binary Alignment/Map (BAM) files in next-generation sequencing pipelines. Similar to other SAM/BAM tools, a key challenge in elPrep is memory management, as such programs need to manipulate large amounts of data. We therefore investigated 3 programming languages with support for assisted or automated memory management for implementing elPrep, namely C++, Go, and Java. We implemented a nontrivial subset of elPrep in all 3 programming languages and compared them by benchmarking their runtime performance and memory use to determine the best language in terms of computational performance. In a previous article, we motivated why, based on these results, we eventually selected Go as our implementation language. In this article, we discuss the difficulty of achieving the best performance in each language in terms of programming language constructs and standard library support. While benchmarks are easy to objectively measure and evaluate, this is less obvious for assessing ease of programming. However, because we expect elPrep to be regularly modified and extended, this is an equally important aspect. We illustrate representative examples of challenges in all 3 languages, and give our opinion why we think that Go is a reasonable choice also in this light.

elPrep 4: A multithreaded framework for sequence analysis.

Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Verachtert, Wilfried.

PLoS One ; 14(2): e0209523, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-30759172

RESUMO

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

Assuntos

Análise de Sequência/métodos , Algoritmos , Biologia Computacional/economia , Biologia Computacional/métodos , Custos e Análise de Custo , Exoma , Análise de Sequência/economia , Software

Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.

Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan.

PLoS One ; 12(3): e0174575, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28358893

RESUMO

Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires â¼28h, Halvade-RNA reduces this runtime to â¼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA/genética , Software , Transcriptoma/genética , Algoritmos , Biologia Computacional , Genômica , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA

elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling.

Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Reumers, Joke.

PLoS One ; 10(7): e0132868, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26182406

RESUMO

elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1:40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost.

Assuntos

Algoritmos , Exoma , Genoma Humano , Alinhamento de Sequência/economia , Software , Benchmarking , Mapeamento de Sequências Contíguas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa