Efficient motif finding algorithms for large-alphabet inputs.

Kuksa, Pavel P; Pavlovic, Vladimir

Kuksa, Pavel P; Pavlovic, Vladimir.

Afiliação

Kuksa PP; Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA. pkuksa@cs.rutgers.edu

BMC Bioinformatics ; 11 Suppl 8: S1, 2010 Oct 26.

Article em En | MEDLINE | ID: mdl-21034426

ABSTRACT

ABSTRACT

BACKGROUND:

We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.

RESULTS:

The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.

CONCLUSIONS:

Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.

Assuntos

Algoritmos; Sítios de Ligação; Biologia Computacional/métodos; Reconhecimento Automatizado de Padrão/métodos; Análise de Sequência de DNA/métodos; Análise de Sequência de Proteína/métodos; Motivos de Aminoácidos; Inteligência Artificial; DNA/química; Bases de Dados Genéticas; Conformação de Ácido Nucleico; Conformação Proteica; Software; Fatores de Transcrição/química

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Sítios de Ligação / Algoritmos / Reconhecimento Automatizado de Padrão / Análise de Sequência de DNA / Biologia Computacional / Análise de Sequência de Proteína Tipo de estudo: Diagnostic_studies / Prognostic_studies Idioma: En Ano de publicação: 2010 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google