Pesquisa | Portal de Pesquisa da BVS Enfermagem

GenAp: a distributed SQL interface for genomic data.

Kozanitis, Christos; Patterson, David A.

BMC Bioinformatics ; 17: 63, 2016 Feb 04.

Artigo em Inglês | MEDLINE | ID: mdl-26846841

RESUMO

BACKGROUND: The impressively low cost and improved quality of genome sequencing provides to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number of projects today sequence the DNA of large patient populations each of which produces at least hundreds of terra-bytes of data. Now the challenge is to provide the produced data on demand to interested parties. RESULTS: In this paper, we show that the response to this challenge is a modified version of Spark SQL, a distributed SQL execution engine, that handles efficiently joins that use genomic intervals as keys. With this modification, Spark SQL serves such joins more than 50× faster than its existing brute force approach and 8× faster than similar distributed implementations. Thus, Spark SQL can replace existing practices to retrieve genomic data and, as we show, allow users to reduce the number of lines of software code that needs to be developed to query such data by an order of magnitude.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Mapeamento Cromossômico , Bases de Dados de Ácidos Nucleicos , Genoma Humano , Humanos , Interface Usuário-Computador

Using Genome Query Language to uncover genetic variation.

Kozanitis, Christos; Heiberg, Andrew; Varghese, George; Bafna, Vineet.

Bioinformatics ; 30(1): 1-8, 2014 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-23751181

RESUMO

MOTIVATION: With high-throughput DNA sequencing costs dropping <$1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. To address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. RESULTS: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5-10 lines of high-level code and search large datasets (100 GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple datasets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection and focuses on statistical inference. AVAILABILITY: GQL can be downloaded from http://cseweb.ucsd.edu/~ckozanit/gql.

Assuntos

Variação Genética , Genoma Humano , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software

Abstractions for Genomics.

Bafna, Vineet; Kozanitis, Christos; Deutsch, Alin; Ohno-Machado, Lucila; Heiberg, Andrew; Varghese, George.

Commun ACM ; 56(1): 83-93, 2013 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-25284821

Compressing genomic sequence fragments using SlimGene.

Kozanitis, Christos; Saunders, Chris; Kruglyak, Semyon; Bafna, Vineet; Varghese, George.

J Comput Biol ; 18(3): 401-13, 2011 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-21385043

RESUMO

With the advent of next generation sequencing technologies, the cost of sequencing whole genomes is poised to go below $1000 per human individual in a few years. As more and more genomes are sequenced, analysis methods are undergoing rapid development, making it tempting to store sequencing data for long periods of time so that the data can be re-analyzed with the latest techniques. The challenging open research problems, huge influx of data, and rapidly improving analysis techniques have created the need to store and transfer very large volumes of data. Compression can be achieved at many levels, including trace level (compressing image data), sequence level (compressing a genomic sequence), and fragment-level (compressing a set of short, redundant fragment reads, along with quality-values on the base-calls). We focus on fragment-level compression, which is the pressing need today. Our article makes two contributions, implemented in a tool, SlimGene. First, we introduce a set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6×. Including quality values, we show a 5× compression using less running time than bzip2. Second, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using "lossy" quality values. Specifically, we show that a lossy quality value quantization results in 14× compression but has minimal impact on downstream applications like SNP calling that use the quality values. Discrepancies between SNP calls made between the lossy and loss-less versions of the data are limited to low coverage areas where even the SNP calls made by the loss-less version are marginal.

Assuntos

Algoritmos , Compressão de Dados/métodos , Genômica/métodos , Genoma Humano , Humanos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA