A new family of powerful multivariate statistical sequence analysis techniques.
J Mol Biol
; 220(4): 877-87, 1991 Aug 20.
Article
em En
| MEDLINE
| ID: mdl-1880802
A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 x 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, zeta-hemoglobin is found to lie close to amphibian and fish chi-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project.
Buscar no Google
Coleções:
01-internacional
Base de dados:
MEDLINE
Assunto principal:
Análise Multivariada
/
Alinhamento de Sequência
/
Sequência de Aminoácidos
Tipo de estudo:
Prognostic_studies
Limite:
Animals
/
Humans
Idioma:
En
Revista:
J Mol Biol
Ano de publicação:
1991
Tipo de documento:
Article
País de afiliação:
Alemanha