Discrete profile comparison using information bottleneck.

O'Rourke, Sean; Chechik, Gal; Friedman, Robin; Eskin, Eleazar

O'Rourke, Sean; Chechik, Gal; Friedman, Robin; Eskin, Eleazar.

Afiliação

O'Rourke S; Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Dr., San Diego, CA 92093, USA. seano@cs.ucsd.edu

BMC Bioinformatics ; 7 Suppl 1: S8, 2006 Mar 20.

Article em En | MEDLINE | ID: mdl-16723011

RESUMO

Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible.

Assuntos

Biologia Computacional/métodos; Análise de Sequência de Proteína/métodos; Algoritmos; Sequência de Aminoácidos; Análise por Conglomerados; Bases de Dados de Proteínas; Modelos Biológicos; Modelos Estatísticos; Dados de Sequência Molecular; Probabilidade; Dobramento de Proteína; Alinhamento de Sequência; Homologia de Sequência de Aminoácidos; Software

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Biologia Computacional / Análise de Sequência de Proteína Tipo de estudo: Prognostic_studies / Risk_factors_studies Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2006 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google