Search | Brasil - Virtual Health Library

Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments.

Sierk, Michael L; Smoot, Michael E; Bass, Ellen J; Pearson, William R.

BMC Bioinformatics ; 11: 146, 2010 Mar 22.

Article in English | MEDLINE | ID: mdl-20307279

ABSTRACT

BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. RESULTS: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance. CONCLUSIONS: The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

Subject(s)

Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein

The limits of protein sequence comparison?

Pearson, William R; Sierk, Michael L.

Curr Opin Struct Biol ; 15(3): 254-60, 2005 Jun.

Article in English | MEDLINE | ID: mdl-15919194

ABSTRACT

Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.

Subject(s)

Algorithms , Databases, Protein , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Molecular Sequence Data , Proteins/analysis , Proteins/classification , Sequence Alignment/trends , Sequence Analysis, Protein/trends , Sequence Homology, Amino Acid , Structure-Activity Relationship

Déjà vu all over again: finding and analyzing protein structure similarities.

Sierk, Michael L; Kleywegt, Gerard J.

Structure ; 12(12): 2103-11, 2004 Dec.

Article in English | MEDLINE | ID: mdl-15576025

ABSTRACT

Structure comparison is a crucial aspect of structural biology today. The field of structure comparison is developing rapidly, with the development of new algorithms, similarity scores, and statistical scores. The predicted large increase of experimental structures and structural models made possible by high-throughput efforts means that structural comparison and searching of structural databases using automated methods will become increasingly common. This Ways & Means article is meant to guide the structural biologist in the basics of structural alignment, and to provide an overview of the available software tools. The main purpose is to encourage users to gain some understanding of the strengths and limitations of structural alignment, and to take these factors into account when interpreting the results of different programs.

Subject(s)

Proteins/genetics , Proteins/physiology , Sequence Analysis, Protein , Sequence Homology , Algorithms , Databases, Protein , Models, Molecular , Protein Structure, Secondary , Sequence Alignment

Sensitivity and selectivity in protein structure comparison.

Sierk, Michael L; Pearson, William R.

Protein Sci ; 13(3): 773-85, 2004 Mar.

Article in English | MEDLINE | ID: mdl-14978311

ABSTRACT

Seven protein structure comparison methods and two sequence comparison programs were evaluated on their ability to detect either protein homologs or domains with the same topology (fold) as defined by the CATH structure database. The structure alignment programs Dali, Structal, Combinatorial Extension (CE), VAST, and Matras were tested along with SGM and PRIDE, which calculate a structural distance between two domains without aligning them. We also tested two sequence alignment programs, SSEARCH and PSI-BLAST. Depending upon the level of selectivity and error model, structure alignment programs can detect roughly twice as many homologous domains in CATH as sequence alignment programs. Dali finds the most homologs, 321-533 of 1120 possible true positives (28.7%-45.7%), at an error rate of 0.1 errors per query (EPQ), whereas PSI-BLAST finds 365 true positives (32.6%), regardless of the error model. At an EPQ of 1.0, Dali finds 42%-70% of possible homologs, whereas Matras finds 49%-57%; PSI-BLAST finds 36.9%. However, Dali achieves >84% coverage before the first error for half of the families tested. Dali and PSI-BLAST find 9.2% and 5.2%, respectively, of the 7056 possible topology pairs at an EPQ of 0.1 and 19.5, and 5.9% at an EPQ of 1.0. Most statistical significance estimates reported by the structural alignment programs overestimate the significance of an alignment by orders of magnitude when compared with the actual distribution of errors. These results help quantify the statistical distinction between analogous and homologous structures, and provide a benchmark for structure comparison statistics.

Subject(s)

Databases, Protein , Sequence Alignment/methods , Software/standards , Structural Homology, Protein , Computational Biology/methods , Data Interpretation, Statistical , Proteins/chemistry , ROC Curve , Reproducibility of Results , Sequence Alignment/statistics & numerical data , Software/statistics & numerical data , Software Validation

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL