Your browser doesn't support javascript.
loading
A hybrid computational strategy to address WGS variant analysis in >5000 samples.
Huang, Zhuoyi; Rustagi, Navin; Veeraraghavan, Narayanan; Carroll, Andrew; Gibbs, Richard; Boerwinkle, Eric; Venkata, Manjunath Gorentla; Yu, Fuli.
Afiliação
  • Huang Z; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
  • Rustagi N; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
  • Veeraraghavan N; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
  • Carroll A; DNAnexus, Mountain View, CA, USA.
  • Gibbs R; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
  • Boerwinkle E; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
  • Venkata MG; Human Genetics Center, University of Texas Health Science Center, Houston, TX, USA.
  • Yu F; Oak Ridge National Laboratory, Oak Ridge, TN, USA.
BMC Bioinformatics ; 17(1): 361, 2016 Sep 10.
Article em En | MEDLINE | ID: mdl-27612449
BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. RESULTS: We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. CONCLUSIONS: Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Genoma Humano / Genômica / Sequenciamento de Nucleotídeos em Larga Escala Limite: Humans Idioma: En Ano de publicação: 2016 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Genoma Humano / Genômica / Sequenciamento de Nucleotídeos em Larga Escala Limite: Humans Idioma: En Ano de publicação: 2016 Tipo de documento: Article