Your browser doesn't support javascript.
loading
Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies.
Wheeler, Nicholas R; Benchek, Penelope; Kunkle, Brian W; Hamilton-Nelson, Kara L; Warfe, Mike; Fondran, Jeremy R; Haines, Jonathan L; Bush, William S.
Affiliation
  • Wheeler NR; Cleveland Institute for Computational Biology, Department of Population and Quantitative Health Sciences, Case Western Reserve University, Wolstein Research Building, 2103 Cornell Road, Cleveland OH 44106, USA, nrw16@case.edu.
Pac Symp Biocomput ; 25: 523-534, 2020.
Article in En | MEDLINE | ID: mdl-31797624
ABSTRACT
Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer's Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies.
Subject(s)

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Computational Biology / Genomics / High-Throughput Nucleotide Sequencing Type of study: Prognostic_studies Limits: Humans Language: En Journal: Pac Symp Biocomput Journal subject: BIOTECNOLOGIA / INFORMATICA MEDICA Year: 2020 Document type: Article

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Computational Biology / Genomics / High-Throughput Nucleotide Sequencing Type of study: Prognostic_studies Limits: Humans Language: En Journal: Pac Symp Biocomput Journal subject: BIOTECNOLOGIA / INFORMATICA MEDICA Year: 2020 Document type: Article