Enhanced regulatory sequence prediction using gapped k-mer features.

Ghandi, Mahmoud; Lee, Dongwon; Mohammad-Noori, Morteza; Beer, Michael A

Ghandi, Mahmoud; Lee, Dongwon; Mohammad-Noori, Morteza; Beer, Michael A.

Afiliação

Ghandi M; Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
Lee D; Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
Mohammad-Noori M; School of Mathematics, Statistics and Computer Science, University of Tehran, Tehran, Iran; School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
Beer MA; Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America.

PLoS Comput Biol ; 10(7): e1003711, 2014 Jul.

Article em En | MEDLINE | ID: mdl-25033408

RESUMO

Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

Assuntos

Biologia Computacional/métodos; Modelos Genéticos; Sequências Reguladoras de Ácido Nucleico/genética; Análise de Sequência de DNA/métodos; Sequência de Bases; Teorema de Bayes; Imunoprecipitação da Cromatina; Oligonucleotídeos/genética; Especificidade de Órgãos/genética; Máquina de Vetores de Suporte

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Sequências Reguladoras de Ácido Nucleico / Análise de Sequência de DNA / Biologia Computacional / Modelos Genéticos Tipo de estudo: Prognostic_studies / Risk_factors_studies Idioma: En Revista: PLoS Comput Biol Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2014 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google