Pesquisa | Portal Regional da BVS

SeQual-Stream: approaching stream processing to quality control of NGS datasets.

Castellanos-Rodríguez, Óscar; Expósito, Roberto R; Touriño, Juan.

BMC Bioinformatics ; 24(1): 403, 2023 Oct 27.

Artigo em Inglês | MEDLINE | ID: mdl-37891497

RESUMO

BACKGROUND: Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. RESULTS: In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7[Formula: see text] when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features. CONCLUSION: Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream .

Assuntos

Genômica , Software , Genômica/métodos , Genoma , Sequência de Bases , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos

ï»¿SparkEC: speeding up alignment-based DNA error correction tools.

Expósito, Roberto R; Martínez-Sánchez, Marco; Touriño, Juan.

BMC Bioinformatics ; 23(1): 464, 2022 Nov 07.

Artigo em Inglês | MEDLINE | ID: mdl-36344928

RESUMO

BACKGROUND: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. RESULTS: In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text], respectively, over its counterpart. CONCLUSION: As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos , Algoritmos , DNA/genética

HPC Tools to Deal with Microarray Data.

González-Domínguez, Jorge; Expósito, Roberto R.

Methods Mol Biol ; 1986: 227-243, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31115891

RESUMO

Parallel and high performance computing is continuously gaining attention in the last years as a means to accelerate several kind of computationally expensive applications. This chapter is a review of different research works and publicly available tools whose target is the acceleration of microarray data analysis, thanks to exploiting high performance computing systems.

Assuntos

Metodologias Computacionais , Análise em Microsséries/métodos , Computação em Nuvem , Epistasia Genética , Redes Reguladoras de Genes

HSRA: Hadoop-based spliced read aligner for RNA sequencing data.

Expósito, Roberto R; González-Domínguez, Jorge; Touriño, Juan.

PLoS One ; 13(7): e0201483, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30063721

RESUMO

Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user's guide are publicly available for download at http://hsra.dec.udc.es.

Assuntos

Big Data , Sequenciamento de Nucleotídeos em Larga Escala , Dobramento de RNA , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software

ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems.

González-Domínguez, Jorge; Expósito, Roberto R.

PLoS One ; 13(4): e0194361, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29608567

RESUMO

Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.

Assuntos

Software , Análise por Conglomerados , Linguagens de Programação , Navegador

MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud.

Expósito, Roberto R; Veiga, Jorge; González-Domínguez, Jorge; Touriño, Juan.

Bioinformatics ; 33(17): 2762-2764, 2017 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-28475668

RESUMO

SUMMARY: This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool. AVAILABILITY AND IMPLEMENTATION: Source code in Java and Hadoop as well as a user's guide are freely available under the GNU GPLv3 license at http://mardre.des.udc.es . CONTACT: rreye@udc.es.

Assuntos

Análise de Sequência de DNA/métodos , Software , Algoritmos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA