Cloud-based interactive analytics for terabytes of genomic variants data.

Pan, Cuiping; McInnes, Gregory; Deflaux, Nicole; Snyder, Michael; Bingham, Jonathan; Datta, Somalee; Tsao, Philip S

Pan, Cuiping; McInnes, Gregory; Deflaux, Nicole; Snyder, Michael; Bingham, Jonathan; Datta, Somalee; Tsao, Philip S.

Afiliação

Pan C; VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA 94304, USA.
McInnes G; Department of Genetics.
Deflaux N; VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA 94304, USA.
Snyder M; Stanford Center for Genomics and Personalized Medicine, Stanford University, CA 94305, USA.
Bingham J; Google, Mountain View, CA 94043, USA.
Datta S; Verily Life Sciences, South San Francisco, CA 94080, USA.
Tsao PS; Department of Genetics.

Bioinformatics ; 33(23): 3709-3715, 2017 Dec 01.

Article em En | MEDLINE | ID: mdl-28961771

RESUMO

MOTIVATION: Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired. RESULTS: We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information. AVAILABILITY AND IMPLEMENTATION: Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs. CONTACT: cuiping@stanford.edu or ptsao@stanford.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Variação Genética; Genômica/métodos; Compressão de Dados; Bases de Dados de Ácidos Nucleicos; Frequência do Gene; Genoma Humano; Genótipo; Humanos; Software; Navegador

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Variação Genética / Genômica Limite: Humans Idioma: En Ano de publicação: 2017 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google