Search | VHL Search Portal

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo.

Bioinformatics ; 30(1): 119-20, 2014 Jan 01.

Article in English | MEDLINE | ID: mdl-24149054

ABSTRACT

SUMMARY: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts. AVAILABILITY AND IMPLEMENTATION: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

Subject(s)

High-Throughput Screening Assays/methods , Software Design

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André; Klemelä, Petri; Korpelainen, Eija; Heljanko, Keijo.

Bioinformatics ; 28(6): 876-7, 2012 Mar 15.

Article in English | MEDLINE | ID: mdl-22302568

ABSTRACT

Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Genome , User-Computer Interface

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL