Your browser doesn't support javascript.
loading
A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.
Wang, WeiBo; Sun, Wei; Wang, Wei; Szatkiewicz, Jin.
Afiliação
  • Wang W; Department of Computer Science, University of North Carolina at Chapel Hill, 201 S. Columbia St., Chapel Hill, 27599-3175, USA.
  • Sun W; Biostatistics Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 19024, USA.
  • Wang W; Department of Computer Science, University of California, Los Angeles, 580 Portola Plaza, Los Angeles, 90095-1596, USA.
  • Szatkiewicz J; Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, Chapel Hill, 27599-7264, USA. jin_szatkiewicz@med.unc.edu.
BMC Bioinformatics ; 19(1): 74, 2018 03 01.
Article em En | MEDLINE | ID: mdl-29490610
ABSTRACT

BACKGROUND:

The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.

RESULTS:

We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection.

CONCLUSIONS:

Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genômica / Variações do Número de Cópias de DNA Tipo de estudo: Clinical_trials / Diagnostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2018 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genômica / Variações do Número de Cópias de DNA Tipo de estudo: Clinical_trials / Diagnostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2018 Tipo de documento: Article País de afiliação: Estados Unidos