RESUMO
Understanding of molecular mechanisms that govern protein-protein interactions and accurate modeling of protein-protein docking rely on accurate identification and prediction of protein-binding partners and protein-binding residues. We review over 40 methods that predict protein-protein interactions from protein sequences including methods that predict interacting protein pairs, protein-binding residues for a pair of interacting sequences and protein-binding residues in a single protein chain. We focus on the latter methods that provide residue-level annotations and that can be broadly applied to all protein sequences. We compare their architectures, inputs and outputs, and we discuss aspects related to their assessment and availability. We also perform first-of-its-kind comprehensive empirical comparison of representative predictors of protein-binding residues using a novel and high-quality benchmark data set. We show that the selected predictors accurately discriminate protein-binding and non-binding residues and that newer methods outperform older designs. However, these methods are unable to accurately separate residues that bind other molecules, such as DNA, RNA and small ligands, from the protein-binding residues. This cross-prediction, defined as the incorrect prediction of nucleic-acid- and small-ligand-binding residues as protein binding, is substantial for all evaluated methods and is not driven by the proximity to the native protein-binding residues. We discuss reasons for this drawback and we offer several recommendations. In particular, we postulate the need for a new generation of more accurate predictors and data sets, inclusion of a comprehensive assessment of the cross-predictions in future studies and higher standards of availability of the published methods.
Assuntos
Ligação Proteica/genética , Sequência de Aminoácidos , Sítios de Ligação/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Ligantes , Ácidos Nucleicos/metabolismo , Domínios e Motivos de Interação entre Proteínas/genética , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína/estatística & dados numéricos , Software , Homologia Estrutural de ProteínaRESUMO
With the advent of high-throughput proteomics, the type and amount of data pose a significant challenge to statistical approaches used to validate current quantitative analysis. Whereas many studies focus on the analysis at the protein level, the analysis of peptide-level data provides insight into changes at the sub-protein level, including splice variants, isoforms and a range of post-translational modifications. Statistical evaluation of liquid chromatography-mass spectrometry/mass spectrometry peptide-based label-free differential data is most commonly performed using a t-test or analysis of variance, often after the application of data imputation to reduce the number of missing values. In high-throughput proteomics, statistical analysis methods and imputation techniques are difficult to evaluate, given the lack of gold standard data sets. Here, we use experimental and resampled data to evaluate the performance of four statistical analysis methods and the added value of imputation, for different numbers of biological replicates. We find that three or four replicates are the minimum requirement for high-throughput data analysis and confident assignment of significant changes. Data imputation does increase sensitivity in some cases, but leads to a much higher actual false discovery rate. Additionally, we find that empirical Bayes method (limma) achieves the highest sensitivity, and we thus recommend its use for performing differential expression analysis at the peptide level.
Assuntos
Peptídeos/genética , Peptídeos/metabolismo , Proteômica/métodos , Teorema de Bayes , Cromatografia Líquida , Biologia Computacional/métodos , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Análise Serial de Proteínas/estatística & dados numéricos , Proteômica/estatística & dados numéricos , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , Espectrometria de Massas em TandemRESUMO
While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.
Assuntos
Algoritmos , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Camundongos , Peptídeos/química , Proteômica/estatística & dados numéricos , Pyrococcus furiosus/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Sitios de Sequências Rotuladas , Software , Espectrometria de Massas em Tandem/métodos , Espectrometria de Massas em Tandem/estatística & dados numéricosRESUMO
Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.
Assuntos
Sítios de Ligação/fisiologia , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Modelos Moleculares , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas/fisiologia , Elementos Estruturais de Proteínas , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricosRESUMO
Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence.
Assuntos
Proteínas/química , Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Biologia Computacional , Bases de Dados de Proteínas , Plasmodium falciparum/química , Plasmodium falciparum/genética , Domínios Proteicos , Proteínas de Protozoários/química , Proteínas de Protozoários/genética , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricosRESUMO
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Assuntos
Teoria dos Jogos , Proteínas Supressoras de Tumor/genética , Sequência de Aminoácidos , Animais , Sequência de Bases , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Complexo I de Transporte de Elétrons/genética , Humanos , Conceitos Matemáticos , Proteínas Mitocondriais/genética , Modelos Biológicos , NADH Desidrogenase/genética , Dinâmica não Linear , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Homologia de Sequência de Aminoácidos , Máquina de Vetores de Suporte , Transferrina/genética , Proteínas Supressoras de Tumor/classificação , Proteínas Supressoras de Tumor/fisiologia , Globinas beta/genéticaRESUMO
De novo sequencing offers an alternative to database search methods for peptide identification from mass spectra. Since it does not rely on a predetermined database of expected or potential sequences in the sample, de novo sequencing is particularly appropriate for samples lacking a well-defined or comprehensive reference database. However, the low accuracy of many de novo sequence predictions has prevented the widespread use of the variety of sequencing tools currently available. Here, we present a new open-source tool, Postnovo, that postprocesses de novo sequence predictions to find high-accuracy results. Postnovo uses a predictive model to rescore and rerank candidate sequences in a manner akin to database search postprocessing tools such as Percolator. Postnovo leverages the output from multiple de novo sequencing tools in its own analyses, producing many times the length of amino acid sequence information (including both full- and partial-length peptide sequences) at an equivalent false discovery rate (FDR) compared to any individual tool. We present a methodology to reliably screen the sequence predictions to a desired FDR given the Postnovo sequence score. We validate Postnovo with multiple data sets and demonstrate its ability to identify proteins that are missed by database search even in samples with paired reference databases.
Assuntos
Algoritmos , Peptídeos/isolamento & purificação , Proteínas/química , Análise de Sequência de Proteína/estatística & dados numéricos , Software , Animais , Bacillus subtilis/química , Abelhas/química , Desulfovibrio vulgaris/química , Drosophila melanogaster/química , Embrião não Mamífero/química , Escherichia coli K12/química , Humanos , Solanum lycopersicum/química , Methanosarcina/química , Camundongos , Peptídeos/química , Peptídeos/classificação , Proteólise , Rodopseudomonas/química , Synechococcus/químicaRESUMO
Over the past decade, a suite of new mass-spectrometry-based proteomics methods has been developed that now enables the conformational properties of proteins and protein-ligand complexes to be studied in complex biological mixtures, from cell lysates to intact cells. Highlighted here are seven of the techniques in this new toolbox. These techniques include chemical cross-linking (XL-MS), hydroxyl radical footprinting (HRF), Drug Affinity Responsive Target Stability (DARTS), Limited Proteolysis (LiP), Pulse Proteolysis (PP), Stability of Proteins from Rates of Oxidation (SPROX), and Thermal Proteome Profiling (TPP). The above techniques all rely on conventional bottom-up proteomics strategies for peptide sequencing and protein identification. However, they have required the development of unconventional proteomic data analysis strategies. Discussed here are the current technical challenges associated with these different data analysis strategies as well as the relative analytical capabilities of the different techniques. The new biophysical capabilities that the above techniques bring to bear on proteomic research are also highlighted in the context of several different application areas in which these techniques have been used, including the study of protein ligand binding interactions (e.g., protein target discovery studies and protein interaction network analyses) and the characterization of biological states.
Assuntos
Espectrometria de Massas/métodos , Processamento de Proteína Pós-Traducional , Proteínas/química , Proteoma/química , Proteômica/tendências , Animais , Reagentes de Ligações Cruzadas/química , Bases de Dados de Proteínas , Medição da Troca de Deutério/métodos , Humanos , Marcação por Isótopo/métodos , Ligantes , Espectrometria de Massas/instrumentação , Ligação Proteica , Dobramento de Proteína , Estabilidade Proteica , Proteínas/metabolismo , Proteínas/ultraestrutura , Proteólise , Proteoma/ultraestrutura , Proteômica/instrumentação , Proteômica/métodos , Análise de Sequência de Proteína/instrumentação , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , TermodinâmicaRESUMO
Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modeled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evolution in our model allows for modeling both "smooth" conformational changes and "catastrophic" conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence-structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof.
Assuntos
Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Evolução Molecular , Modelos Genéticos , Modelos Moleculares , Conformação Proteica , Elementos Estruturais de Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína/estatística & dados numéricosRESUMO
The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow databases of reduced complexity to rapidly perform homology extension. This server also gives the possibility to use transmembrane proteins (TMPs) reference databases to allow even faster homology extension on this important category of proteins. Aside from an MSA, the server also outputs topological prediction of TMPs using the HMMTOP algorithm. Previous benchmarking of the method has shown this approach outperforms the most accurate alignment methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. The web server is available at http://tcoffee.crg.cat/tmcoffee.
Assuntos
Algoritmos , Proteínas de Membrana/química , Análise de Sequência de Proteína/estatística & dados numéricos , Interface Usuário-Computador , Sequência de Aminoácidos , Gráficos por Computador , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação , Internet , Proteínas de Membrana/genética , Domínios Proteicos , Estrutura Secundária de Proteína , Alinhamento de Sequência , Homologia de Sequência de AminoácidosRESUMO
The MPI Bioinformatics Toolkit (http://toolkit.tuebingen.mpg.de) is an open, interactive web service for comprehensive and collaborative protein bioinformatic analysis. It offers a wide array of interconnected, state-of-the-art bioinformatics tools to experts and non-experts alike, developed both externally (e.g. BLAST+, HMMER3, MUSCLE) and internally (e.g. HHpred, HHblits, PCOILS). While a beta version of the Toolkit was released 10 years ago, the current production-level release has been available since 2008 and has serviced more than 1.6 million external user queries. The usage of the Toolkit has continued to increase linearly over the years, reaching more than 400 000 queries in 2015. In fact, through the breadth of its tools and their tight interconnection, the Toolkit has become an excellent platform for experimental scientists as well as a useful resource for teaching bioinformatic inquiry to students in the life sciences. In this article, we report on the evolution of the Toolkit over the last ten years, focusing on the expansion of the tool repertoire (e.g. CS-BLAST, HHblits) and on infrastructural work needed to remain operative in a changing web environment.
Assuntos
Biologia Computacional/métodos , Internet , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Biologia Computacional/educação , Biologia Computacional/tendências , Anotação de Sequência Molecular , Domínios Proteicos , Proteínas/classificação , Análise de Sequência de Proteína/estatística & dados numéricos , Análise de Sequência de Proteína/tendências , Software/tendências , EnsinoRESUMO
Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance-covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.
Assuntos
Modelos Logísticos , Análise de Sequência de Proteína/estatística & dados numéricos , Vacinas contra a AIDS/genética , Vacinas contra a AIDS/imunologia , Anticorpos Neutralizantes/imunologia , Bioestatística , Simulação por Computador , Anticorpos Anti-HIV/imunologia , HIV-1/genética , HIV-1/imunologia , Humanos , Imunoglobulina A/imunologia , Imunoglobulina G/imunologia , Cadeias de Markov , Modelos Estatísticos , Produtos do Gene env do Vírus da Imunodeficiência Humana/genética , Produtos do Gene env do Vírus da Imunodeficiência Humana/imunologiaRESUMO
De novo sequencing of proteins and peptides is one of the most important problems in mass spectrometry-driven proteomics. A variety of methods have been developed to accomplish this task from a set of bottom-up tandem (MS/MS) mass spectra. However, a more recently emerged top-down technology, now gaining more and more popularity, opens new perspectives for protein analysis and characterization, implying a need for efficient algorithms to process this kind of MS/MS data. Here, we describe a method that allows for the retrieval, from a set of top-down MS/MS spectra, of long and accurate sequence fragments of the proteins contained in the sample. To this end, we outline a strategy for generating high-quality sequence tags from top-down spectra, and introduce the concept of a T-Bruijn graph by adapting to the case of tags the notion of an A-Bruijn graph widely used in genomics. The output of the proposed approach represents the set of amino acid strings spelled out by optimal paths in the connected components of a T-Bruijn graph. We illustrate its performance on top-down data sets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab.
Assuntos
Algoritmos , Peptídeos/isolamento & purificação , Proteômica/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Espectrometria de Massas em Tandem/estatística & dados numéricos , Alemtuzumab , Sequência de Aminoácidos , Animais , Anticorpos Monoclonais Humanizados/química , Anidrase Carbônica II/química , Bovinos , Bases de Dados de Proteínas , Humanos , Fragmentos Fab das Imunoglobulinas/química , Dados de Sequência Molecular , Peptídeos/química , Proteômica/métodos , Coloração e Rotulagem/métodosRESUMO
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.
Assuntos
Algoritmos , Biologia Computacional/métodos , Fragmentos de Peptídeos/análise , Proteínas/análise , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Armoracia/genética , Bovinos , Biologia Computacional/estatística & dados numéricos , Escherichia coli/genética , Cavalos/genética , Humanos , Camundongos , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Análise de Sequência de Proteína/estatística & dados numéricos , Espectrometria de Massas em Tandem/normasRESUMO
The goal of dimensionality reduction is to embed high-dimensional data in a low-dimensional space while preserving structure in the data relevant to exploratory data analysis such as clusters. However, existing dimensionality reduction methods often either fail to separate clusters due to the crowding problem or can only separate clusters at a single resolution. We develop a new approach to dimensionality reduction: tree preserving embedding. Our approach uses the topological notion of connectedness to separate clusters at all resolutions. We provide a formal guarantee of cluster separation for our approach that holds for finite samples. Our approach requires no parameters and can handle general types of data, making it easy to use in practice and suggesting new strategies for robust data visualization.
Assuntos
Interpretação Estatística de Dados , Algoritmos , Análise por Conglomerados , Escrita Manual , Modelos Estatísticos , Radar , Análise de Sequência de Proteína/estatística & dados numéricosRESUMO
Because of its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semitryptic and nontryptic peptides. Many of these peptides are thought to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. If other possibilities such as post-translational modifications and single-amino acid polymorphisms are ignored, this suggests that many unidentified spectra originate from semitryptic and nontryptic peptides. To include them in database searches, however, may not improve overall peptide identification because of the possible sensitivity reduction from search space expansion. To circumvent this issue for E-value-based search methods, we have designed a scheme that categorizes qualified peptides (i.e., peptides whose differences in molecular weight from the parent ion are within a specified error tolerance) into three tiers: tryptic, semitryptic, and nontryptic. This classification allows peptides that belong to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance compared to those of search strategies that assign equal Bonferroni correction factors to all qualified peptides.
Assuntos
Algoritmos , Modelos Estatísticos , Anotação de Sequência Molecular/estatística & dados numéricos , Fragmentos de Peptídeos/isolamento & purificação , Análise de Sequência de Proteína/estatística & dados numéricos , Animais , Humanos , Proteólise , Proteômica , Sensibilidade e Especificidade , Espectrometria de Massas em Tandem , Tripsina/químicaRESUMO
BACKGROUND: Protein structures are comprised of modular elements known as domains. These units are used and re-used over and over in nature, and usually serve some particular function in the structure. Thus it is useful to be able to break up a protein of interest into its component domains, prior to similarity searching for example. Numerous computational methods exist for doing so, but most operate only on a single protein chain and many are limited to making a series of cuts to the sequence, while domains can and do span multiple chains. RESULTS: This study presents a novel clustering-based approach to domain identification, which works equally well on individual chains or entire complexes. The method is simple and fast, taking only a few milliseconds to run, and works by clustering either vectors representing secondary structure elements, or buried alpha-carbon positions, using average-linkage clustering. Each resulting cluster corresponds to a domain of the structure. The method is competitive with others, achieving 70% agreement with SCOP on a large non-redundant data set, and 80% on a set more heavily weighted in multi-domain proteins on which both SCOP and CATH agree. CONCLUSIONS: It is encouraging that a basic method such as this performs nearly as well or better than some far more complex approaches. This suggests that protein domains are indeed for the most part simply compact regions of structure with a higher density of buried contacts within themselves than between each other. By representing the structure as a set of points or vectors in space, it allows us to break free of any artificial limitations that other approaches may depend upon.
Assuntos
Algoritmos , Biologia Computacional/métodos , Estrutura Terciária de Proteína , Proteínas/química , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , Análise por Conglomerados , Estrutura Secundária de ProteínaRESUMO
BACKGROUND: Molecular recognition in proteins occurs due to appropriate arrangements of physical, chemical, and geometric properties of an atomic surface. Similar surface regions should create similar binding interfaces. Effective methods for comparing surface regions can be used in identifying similar regions, and to predict interactions without regard to the underlying structural scaffold that creates the surface. RESULTS: We present a new descriptor for protein functional surfaces and algorithms for using these descriptors to compare protein surface regions to identify ligand binding interfaces. Our approach uses descriptors of local regions of the surface, and assembles collections of matches to compare larger regions. Our approach uses a variety of physical, chemical, and geometric properties, adaptively weighting these properties as appropriate for different regions of the interface. Our approach builds a classifier based on a training corpus of examples of binding sites of the target ligand. The constructed classifiers can be applied to a query protein providing a probability for each position on the protein that the position is part of a binding interface. We demonstrate the effectiveness of the approach on a number of benchmarks, demonstrating performance that is comparable to the state-of-the-art, with an approach with more generality than these prior methods. CONCLUSIONS: Local functional descriptors offer a new method for protein surface comparison that is sufficiently flexible to serve in a variety of applications.
Assuntos
Sítios de Ligação , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , Algoritmos , Ligantes , Modelos Moleculares , Estrutura Terciária de Proteína , Proteínas/química , SoftwareRESUMO
BACKGROUND: Methods of weakening and attenuating pathogens' abilities to infect and propagate in a host, thus allowing the natural immune system to more easily decimate invaders, have gained attention as alternatives to broad-spectrum targeting approaches. The following work describes a technique to identifying proteins involved in virulence by relying on latent information computationally gathered across biological repositories, applicable to both generic and specific virulence categories. RESULTS: A lightweight method for data integration is used, which links information regarding a protein via a path-based query graph. A method of weighting is then applied to query graphs that can serve as input to various statistical classification methods for discrimination, and the combined usage of both data integration and learning methods are tested against the problem of both generalized and specific virulence function prediction. CONCLUSIONS: This approach improves coverage of functional data over a protein. Moreover, while depending largely on noisy and potentially non-curated data from public sources, we find it outperforms other techniques to identification of general virulence factors and baseline remote homology detection methods for specific virulence categories.
Assuntos
Proteínas/classificação , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos , Fatores de Virulência/classificação , Interpretação Estatística de Dados , Bases de Dados de Proteínas , Proteínas/química , Virulência , Fatores de Virulência/químicaRESUMO
BACKGROUND: Influenza is one of the oldest and deadliest infectious diseases known to man. Reassorted strains of the virus pose the greatest risk to both human and animal health and have been associated with all pandemics of the past century, with the possible exception of the 1918 pandemic, resulting in tens of millions of deaths. We have developed and tested new computer algorithms, FluShuffle and FluResort, which enable reassorted viruses to be identified by the most rapid and direct means possible. These algorithms enable reassorted influenza, and other, viruses to be rapidly identified to allow prevention strategies and treatments to be more efficiently implemented. RESULTS: The FluShuffle and FluResort algorithms were tested with both experimental and simulated mass spectra of whole virus digests. FluShuffle considers different combinations of viral protein identities that match the mass spectral data using a Gibbs sampling algorithm employing a mixed protein Markov chain Monte Carlo (MCMC) method. FluResort utilizes those identities to calculate the weighted distance of each across two or more different phylogenetic trees constructed through viral protein sequence alignments. Each weighted mean distance value is normalized by conversion to a Z-score to establish a reassorted strain. CONCLUSIONS: The new FluShuffle and FluResort algorithms can correctly identify the origins of influenza viral proteins and the number of reassortment events required to produce the strains from the high resolution mass spectral data of whole virus proteolytic digestions. This has been demonstrated in the case of constructed vaccine strains as well as common human seasonal strains of the virus. The algorithms significantly improve the capability of the proteotyping approach to identify reassorted viruses that pose the greatest pandemic risk.