Pesquisa | Portal de Pesquisa da BVS

A big data approach to metagenomics for all-food-sequencing.

Kobus, Robin; Abuín, José M; Müller, André; Hellmann, Sören Lukas; Pichel, Juan C; Pena, Tomás F; Hildebrandt, Andreas; Hankeln, Thomas; Schmidt, Bertil.

BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.

Artigo em Inglês | MEDLINE | ID: mdl-32164527

RESUMO

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).

Assuntos

Big Data , Análise de Alimentos/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Sequenciamento Completo do Genoma/métodos , Biovigilância , Genoma Bacteriano , Metagenoma , Microbiota/genética , Software

PASTASpark: multiple sequence alignment meets Big Data.

Abuín, José M; Pena, Tomás F; Pichel, Juan C.

Bioinformatics ; 33(18): 2948-2950, 2017 Sep 15.

Artigo em Inglês | MEDLINE | ID: mdl-28582480

RESUMO

MOTIVATION: One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption. RESULTS: Speedups up to 10× with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200 000 sequences within the 24-h limit. AVAILABILITY AND IMPLEMENTATION: PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark. CONTACT: josemanuel.abuin@usc.es. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

Bioinformatics ; 31(24): 4003-5, 2015 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-26323715

RESUMO

UNLABELLED: BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Important reductions in the execution times were observed when using this tool. In addition, BigBWA is fault tolerant and it does not require any modification of the original BWA source code. AVAILABILITY AND IMPLEMENTATION: BigBWA is available at the project GitHub repository: https://github.com/citiususc/BigBWA.

Assuntos

Alinhamento de Sequência/métodos , Software , Algoritmos , Genômica

Big Data in metagenomics: Apache Spark vs MPI.

Abuín, José M; Lopes, Nuno; Ferreira, Luís; Pena, Tomás F; Schmidt, Bertil.

PLoS One ; 15(10): e0239741, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33022000

RESUMO

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

Assuntos

Big Data , Genoma Bacteriano/genética , Metagenoma/genética , Metagenômica , Algoritmos , Metodologias Computacionais , DNA/genética , Software

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

PLoS One ; 11(5): e0155461, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27182962

RESUMO

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Humanos , Reprodutibilidade dos Testes , Análise de Sequência de DNA/métodos , Navegador , Fluxo de Trabalho

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA