Search | VHL Regional Portal

Big Data in metagenomics: Apache Spark vs MPI.

Abuín, José M; Lopes, Nuno; Ferreira, Luís; Pena, Tomás F; Schmidt, Bertil.

PLoS One ; 15(10): e0239741, 2020.

Article in English | MEDLINE | ID: mdl-33022000

ABSTRACT

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

Subject(s)

Big Data , Genome, Bacterial/genetics , Metagenome/genetics , Metagenomics , Algorithms , Computing Methodologies , DNA/genetics , Software

Very Fast Tree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies.

Piñeiro, César; Abuín, José M; Pichel, Juan C.

Bioinformatics ; 36(17): 4658-4659, 2020 11 01.

Article in English | MEDLINE | ID: mdl-32573652

ABSTRACT

MOTIVATION: FastTree-2 is one of the most successful tools for inferring large phylogenies. With speed at the core of its design, there are still important issues in the FastTree-2 implementation that harm its performance and scalability. To deal with these limitations, we introduce VeryFastTree, a highly tuned implementation of the FastTree-2 tool that takes advantage of parallelization and vectorization strategies to boost performance. RESULTS: VeryFastTree is able to construct a tree on a standard server using double-precision arithmetic from an ultra-large 330k alignment in only 4.5 h, which is 7.8× and 3.5× faster than the sequential and best parallel FastTree-2 times, respectively. AVAILABILITY AND IMPLEMENTATION: VeryFastTree is available at the GitHub repository: https://github.com/citiususc/veryfasttree. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Software , Trees , Algorithms , Computers , Phylogeny , Sequence Alignment

A big data approach to metagenomics for all-food-sequencing.

Kobus, Robin; Abuín, José M; Müller, André; Hellmann, Sören Lukas; Pichel, Juan C; Pena, Tomás F; Hildebrandt, Andreas; Hankeln, Thomas; Schmidt, Bertil.

BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.

Article in English | MEDLINE | ID: mdl-32164527

ABSTRACT

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).

Subject(s)

Big Data , Food Analysis/methods , High-Throughput Nucleotide Sequencing/methods , Metagenomics/methods , Whole Genome Sequencing/methods , Biosurveillance , Genome, Bacterial , Metagenome , Microbiota/genetics , Software

PASTASpark: multiple sequence alignment meets Big Data.

Abuín, José M; Pena, Tomás F; Pichel, Juan C.

Bioinformatics ; 33(18): 2948-2950, 2017 Sep 15.

Article in English | MEDLINE | ID: mdl-28582480

ABSTRACT

MOTIVATION: One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption. RESULTS: Speedups up to 10× with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200 000 sequences within the 24-h limit. AVAILABILITY AND IMPLEMENTATION: PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark. CONTACT: josemanuel.abuin@usc.es. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Computational Biology/methods , Sequence Alignment/methods , Software , Algorithms

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

PLoS One ; 11(5): e0155461, 2016.

Article in English | MEDLINE | ID: mdl-27182962

ABSTRACT

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

Subject(s)

Computational Biology/methods , Genomics/methods , High-Throughput Nucleotide Sequencing , Software , Humans , Reproducibility of Results , Sequence Analysis, DNA/methods , Web Browser , Workflow

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

Bioinformatics ; 31(24): 4003-5, 2015 Dec 15.

Article in English | MEDLINE | ID: mdl-26323715

ABSTRACT

UNLABELLED: BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Important reductions in the execution times were observed when using this tool. In addition, BigBWA is fault tolerant and it does not require any modification of the original BWA source code. AVAILABILITY AND IMPLEMENTATION: BigBWA is available at the project GitHub repository: https://github.com/citiususc/BigBWA.

Subject(s)

Sequence Alignment/methods , Software , Algorithms , Genomics

Gender-related differences in the burden of non-motor symptoms in Parkinson's disease.

Martinez-Martin, Pablo; Falup Pecurariu, Cristian; Odin, Per; van Hilten, Jacobus J; Antonini, Angelo; Rojo-Abuin, Jose M; Borges, Vanderci; Trenkwalder, Claudia; Aarsland, Dag; Brooks, David J; Ray Chaudhuri, Kallol.

J Neurol ; 259(8): 1639-47, 2012 Aug.

Article in English | MEDLINE | ID: mdl-22237822

ABSTRACT

Differences in the expression of non-motor symptoms (NMS) by Parkinson's disease (PD) patients may have important implications for their management and prognosis. Gender is a basic epidemiological variable that could influence such expression. The present study evaluated the prevalence and severity of NMS by gender in an international sample of 951 PD patients, 62.63% males, using the non-motor symptoms scale (NMSS). Assessments for motor impairment and complications, global severity, and health state were also applied. All disease stages were included. No significant gender differences were found for demographic and clinical characteristics. For the entire sample, the most prevalent symptoms were Nocturia (64.88%) and Fatigue (62.78%) and the most prevalent affected domains were Sleep/Fatigue (84.02%) and Miscellaneous (82.44%). Fatigue, feelings of nervousness, feelings of sadness, constipation, restless legs, and pain were more common and severe in women. On the contrary, daytime sleepiness, dribbling saliva, interest in sex, and problems having sex were more prevalent and severe in men. Regarding the NMSS domains, Mood/Apathy and Miscellaneous problems (pain, loss of taste or smell, weight change, and excessive sweating) were predominantly affected in women and Sexual dysfunction in men. No other significant differences by gender were observed. To conclude, in this study significant differences between men and women in prevalence and severity of fatigue, mood, sexual and digestive problems, pain, restless legs, and daytime sleepiness were found. Gender-related patterns of NMS involvement may be relevant for clinical trials in PD.

Subject(s)

Cost of Illness , Parkinson Disease/diagnosis , Parkinson Disease/physiopathology , Sex Characteristics , Aged , Cross-Sectional Studies , Fatigue/diagnosis , Fatigue/epidemiology , Fatigue/physiopathology , Female , Humans , Longitudinal Studies , Male , Mental Disorders/diagnosis , Mental Disorders/epidemiology , Mental Disorders/physiopathology , Middle Aged , Parkinson Disease/epidemiology

Cost-effectiveness of ambulatory blood pressure monitoring in the follow-up of hypertension.

Rodriguez-Roca, Gustavo C; Alonso-Moreno, Francisco J; Garcia-Jimenez, Almudena; Hidalgo-Vega, Alvaro; Llisterri-Caro, Jose L; Barrios-Alonso, Vivencio; Segura-Fragoso, Antonio; Clemente-Lirola, Elvira; Estepa-Jorge, Susana; Delgado-Cejudo, Yolanda; Lopez-Abuin, Jose M.

Blood Press ; 15(1): 27-36, 2006.

Article in English | MEDLINE | ID: mdl-16492613

ABSTRACT

AIMS: To study the cost of the follow-up of hypertension in primary care (PC) using clinical blood pressure (CBP) and ambulatory blood pressure monitoring (ABPM), and to analyse the cost-effectiveness (CE) of both methods. MAJOR FINDINGS AND PRINCIPAL CONCLUSION: Good control of hypertension was achieved in 8.3% with CBP (95% CI 4.8-11.8) and in 55.6% with ABPM (95% CI 49.3-61.9). The cost of one patient with good control of hypertension is almost four times higher with CBP than with ABPM (Euro 940 vs Euro 238). Reaching the gold standard (ABPM) involved an after-cost of Euro 115 per patient. The results for a 5% discount rate showed a saving of Euro 68,883 if ABPM was performed in all the patients included in the study (n = 241, Euro 285 per patient). An analysis of sensitivity, changing the discount rate and life expectancy indicated that ABPM provides a better CE ratio and a lower global cost. ABPM is more cost-effective than CBP. However, if we include the new treatment cost of poorly monitored patients, it is less cost-effective. Excellent control of hypertension is still an important challenge for all healthcare professionals, especially for those working in PC, where most monitoring of hypertensive patients takes place.

Subject(s)

Blood Pressure Monitoring, Ambulatory/economics , Hypertension/economics , Adolescent , Adult , Aged , Aged, 80 and over , Blood Pressure Monitoring, Ambulatory/trends , Cost-Benefit Analysis/methods , Cross-Sectional Studies , Female , Humans , Male , Middle Aged

Proposals for improvement of emergency rural health care.

Lopez-Abuin, Jose M; Garcia-Criado, Emilio I; Chacon-Manzano, Coral M.

Rural Remote Health ; 5(1): 323, 2005.

Article in English | MEDLINE | ID: mdl-15865472

ABSTRACT

Universal healthcare coverage is a right, and that includes emergency health care. The community expects such requirements to be within their reach, including all human and technological resources necessary for rapid and high-quality health assistance in an emergency. Access to and delivery of emergency care in rural areas is recognized as more difficult than that in urban areas. In this report, following the EURIPA meeting in June 2004, the authors determine the problems of dealing with emergencies in the rural healthcare context, and also make proposals for improvement.

Subject(s)

Emergency Medical Services/standards , Guidelines as Topic , Health Services Accessibility/standards , Rural Health Services/standards , European Union , Humans , Quality Assurance, Health Care

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL