RESUMEN
The Paragon Algorithm, a novel database search engine for the identification of peptides from tandem mass spectrometry data, is presented. Sequence Temperature Values are computed using a sequence tag algorithm, allowing the degree of implication by an MS/MS spectrum of each region of a database to be determined on a continuum. Counter to conventional approaches, features such as modifications, substitutions, and cleavage events are modeled with probabilities rather than by discrete user-controlled settings to consider or not consider a feature. The use of feature probabilities in conjunction with Sequence Temperature Values allows for a very large increase in the effective search space with only a very small increase in the actual number of hypotheses that must be scored. The algorithm has a new kind of user interface that removes the user expertise requirement, presenting control settings in the language of the laboratory that are translated to optimal algorithmic settings. To validate this new algorithm, a comparison with Mascot is presented for a series of analogous searches to explore the relative impact of increasing search space probed with Mascot by relaxing the tryptic digestion conformance requirements from trypsin to semitrypsin to no enzyme and with the Paragon Algorithm using its Rapid mode and Thorough mode with and without tryptic specificity. Although they performed similarly for small search space, dramatic differences were observed in large search space. With the Paragon Algorithm, hundreds of biological and artifact modifications, all possible substitutions, and all levels of conformance to the expected digestion pattern can be searched in a single search step, yet the typical cost in search time is only 2-5 times that of conventional small search space. Despite this large increase in effective search space, there is no drastic loss of discrimination that typically accompanies the exploration of large search space.
Asunto(s)
Biología Computacional/métodos , Espectrometría de Masas/métodos , Proteómica/métodos , Algoritmos , Secuencia de Aminoácidos , Animales , Bovinos , Computadores , Humanos , Modelos Estadísticos , Datos de Secuencia Molecular , Péptidos/química , Probabilidad , Temperatura , Tripsina/químicaRESUMEN
We present an MS/MS database search algorithm with the following novel features: (1) a novel protein database structure containing extensive preindexing and (2) zone modification searching, which enables the rapid discovery of protein modifications of known (i.e., user-specified) and unanticipated delta masses. All of these features are implemented in Interrogator, the search engine that runs behind the Pro ID, Pro ICAT, and Pro QUANT software products. Speed benchmarks demonstrate that our modification-tolerant database search algorithm is 100-fold faster than traditional database search algorithms when used for comprehensive searches for a broad variety of modification species. The ability to rapidly search for a large variety of known as well as unanticipated modifications allows a significantly greater percentage of MS/MS scans to be identified. We demonstrate this with an example in which, out of a total of 473 identified MS/MS scans, 315 of these scans correspond to unmodified peptides, while 158 scans correspond to a wide variety of modified peptides. In addition, we provide specific examples where the ability to search for unanticipated modifications allows the scientist to discover: unexpected modifications that have biological significance; amino acid mutations; salt-adducted peptides in a sample that has nominally been desalted; peptides arising from nontryptic cleavage in a sample that has nominally been digested using trypsin; other unintended consequences of sample handling procedures.