ABSTRACT
Recent years witnessed increased interest in intrinsically disordered proteins and regions. These proteins and regions are abundant and possess unique structural features and a broad functional repertoire that complements ordered proteins. However, modern studies on the abundance and functions of intrinsically disordered proteins and regions are relatively limited in size and scope of their analysis. To fill this gap, we performed a broad and detailed computational analysis of over 6 million proteins from 59 archaea, 471 bacterial, 110 eukaryotic and 325 viral proteomes. We used arguably more accurate consensus-based disorder predictions, and for the first time comprehensively characterized intrinsic disorder at proteomic and protein levels from all significant perspectives, including abundance, cellular localization, functional roles, evolution, and impact on structural coverage. We show that intrinsic disorder is more abundant and has a unique profile in eukaryotes. We map disorder into archaea, bacterial and eukaryotic cells, and demonstrate that it is preferentially located in some cellular compartments. Functional analysis that considers over 1,200 annotations shows that certain functions are exclusively implemented by intrinsically disordered proteins and regions, and that some of them are specific to certain domains of life. We reveal that disordered regions are often targets for various post-translational modifications, but primarily in the eukaryotes and viruses. Using a phylogenetic tree for 14 eukaryotic and 112 bacterial species, we analyzed relations between disorder, sequence conservation and evolutionary speed. We provide a complete analysis that clearly shows that intrinsic disorder is exceptionally and uniquely abundant in each domain of life.
Subject(s)
Computational Biology , Intrinsically Disordered Proteins/chemistry , Intrinsically Disordered Proteins/metabolism , Protein Folding , Proteome/analysis , Algorithms , Animals , Archaea/metabolism , Bacteria/metabolism , Databases, Protein , Eukaryota/metabolism , Evolution, Molecular , Humans , Models, Molecular , Viruses/metabolismABSTRACT
Cyclic proteins (CPs) have circular chains with a continuous cycle of peptide bonds. Their unique structural traits result in greater stability and resistance to degradation when compared to their acyclic counterparts. They are also promising targets for pharmaceutical/therapeutic applications. To date, only a few hundred CPs are known, although recent studies suggest that their numbers might be substantially higher. Here we developed a first-of-its-kind, accurate and high-throughput method called CyPred that predicts whether a given protein chain is cyclic. CyPred considers currently well-represented CP families: cyclotides, cyclic defensins, bacteriocins, and trypsin inhibitors. Empirical tests demonstrate that CyPred outperforms commonly used alignment methods. We used CyPred to estimate the incidence of CPs and found ~3500 putative CPs among 5.7+ million chains from 642 fully sequenced proteomes from archaea, bacteria, and eukaryotes. The median number of putative CPs per species ranges from three for archaea proteomes to two for eukaryotes/bacteria, with 7% of archaea, 11% of bacterial, and 16% of eukaryotic proteomes having 10+ CPs. The differences in the estimated fractions of CPs per proteome are as large as three orders of magnitude. Among eukaryotes, animals have higher ratios of CPs compared to fungi, while plants have the largest spread of the ratios. We also show that proteomes enriched in cyclic proteins evolve more slowly than proteomes with fewer cyclic chains. Our results suggest that further research is needed to fully uncover the scope and potential of cyclic proteins. A list of putative CPs and the CyPred method are available at http://biomine.ece.ualberta.ca/CyPred/. This article is part of a Special Issue entitled: Computational Proteomics, Systems Biology & Clinical Implications. Guest Editor: Yudong Cai.
Subject(s)
Computational Biology/methods , Databases, Protein , Proteins/chemistry , Amino Acid Sequence , Animals , Archaea/chemistry , Protein Structure, Tertiary , Sequence AlignmentABSTRACT
MOTIVATION: Off-target interactions of a popular immunosuppressant Cyclosporine A (CSA) with several proteins besides its molecular target, cyclophilin A, are implicated in the activation of signaling pathways that lead to numerous side effects of this drug. RESULTS: Using structural human proteome and a novel algorithm for inverse ligand binding prediction, ILbind, we determined a comprehensive set of 100+ putative partners of CSA. We empirically show that predictive quality of ILbind is better compared with other available predictors for this compound. We linked the putative target proteins, which include many new partners of CSA, with cellular functions, canonical pathways and toxicities that are typical for patients who take this drug. We used complementary approaches (molecular docking, molecular dynamics, surface plasmon resonance binding analysis and enzymatic assays) to validate and characterize three novel CSA targets: calpain 2, caspase 3 and p38 MAP kinase 14. The three targets are involved in the apoptotic pathways, are interconnected and are implicated in nephrotoxicity.
Subject(s)
Cyclosporine/chemistry , Immunosuppressive Agents/chemistry , Proteomics/methods , Algorithms , Calpain/chemistry , Calpain/metabolism , Caspase 3/chemistry , Caspase 3/metabolism , Cyclosporine/metabolism , Humans , Immunosuppressive Agents/metabolism , Mitogen-Activated Protein Kinase 14/chemistry , Mitogen-Activated Protein Kinase 14/metabolism , Molecular Docking Simulation , Proteome/chemistry , Signal Transduction , Surface Plasmon ResonanceABSTRACT
Intrinsic disorder (i.e., lack of a unique 3-D structure) is a common phenomenon, and many biologically active proteins are disordered as a whole, or contain long disordered regions. These intrinsically disordered proteins/regions constitute a significant part of all proteomes, and their functional repertoire is complementary to functions of ordered proteins. In fact, intrinsic disorder represents an important driving force for many specific functions. An illustrative example of such disorder-centric functional class is RNA-binding proteins. In this study, we present the results of comprehensive bioinformatics analyses of the abundance and roles of intrinsic disorder in 3,411 ribosomal proteins from 32 species. We show that many ribosomal proteins are intrinsically disordered or hybrid proteins that contain ordered and disordered domains. Predicted globular domains of many ribosomal proteins contain noticeable regions of intrinsic disorder. We also show that disorder in ribosomal proteins has different characteristics compared to other proteins that interact with RNA and DNA including overall abundance, evolutionary conservation, and involvement in protein-protein interactions. Furthermore, intrinsic disorder is not only abundant in the ribosomal proteins, but we demonstrate that it is absolutely necessary for their various functions.
Subject(s)
Intrinsically Disordered Proteins/genetics , Intrinsically Disordered Proteins/metabolism , Models, Molecular , Protein Conformation , RNA-Binding Proteins/metabolism , Ribosomal Proteins/metabolism , Ribosomes/metabolism , Amino Acids/analysis , Archaea/genetics , Bacteria/genetics , Computational Biology , Conserved Sequence/genetics , Databases, Protein , Eukaryota/genetics , Evolution, Molecular , Protein Structure, Tertiary/genetics , RNA-Binding Proteins/genetics , Ribosomal Proteins/genetics , Species SpecificityABSTRACT
We present the Database of Disordered Protein Prediction (D(2)P(2)), available at http://d2p2.pro (including website source code). A battery of disorder predictors and their variants, VL-XT, VSL2b, PrDOS, PV2, Espritz and IUPred, were run on all protein sequences from 1765 complete proteomes (to be updated as more genomes are completed). Integrated with these results are all of the predicted (mostly structured) SCOP domains using the SUPERFAMILY predictor. These disorder/structure annotations together enable comparison of the disorder predictors with each other and examination of the overlap between disordered predictions and SCOP domains on a large scale. D(2)P(2) will increase our understanding of the interplay between disorder and structure, the genomic distribution of disorder, and its evolutionary history. The parsed data are made available in a unified format for download as flat files or SQL tables either by genome, by predictor, or for the complete set. An interactive website provides a graphical view of each protein annotated with the SCOP domains and disordered regions from all predictors overlaid (or shown as a consensus). There are statistics and tools for browsing and comparing genomes and their disorder within the context of their position on the tree of life.
Subject(s)
Databases, Protein , Protein Conformation , Genome , Internet , Protein Structure, Tertiary , Proteins/chemistry , Proteins/genetics , Sequence Analysis, ProteinABSTRACT
Recent research in the protein intrinsic disorder was stimulated by the availability of accurate computational predictors. However, most of these methods are relatively slow, especially considering proteome-scale applications, and were shown to produce relatively large errors when estimating disorder at the protein- (in contrast to residue-) level, which is defined by the fraction/content of disordered residues. To this end, we propose a novel support vector Regression-based Accurate Predictor of Intrinsic Disorder (RAPID). Key advantages of RAPID are speed (prediction of an average-size eukaryotic proteome takes <1h on a modern desktop computer); sophisticated design (multiple, complementary information sources that are aggregated over an input chain are combined using feature selection); and high-quality and robust predictive performance. Empirical tests on two diverse benchmark datasets reveal that RAPID's predictive performance compares favorably to a comprehensive set of state-of-the-art disorder and disorder content predictors. Drawing on high speed and good predictive quality, RAPID was used to perform large-scale characterization of disorder in 200+ fully sequenced eukaryotic proteomes. Our analysis reveals interesting relations of disorder with structural coverage and chain length, and unusual distribution of fully disordered chains. We also performed a comprehensive (using 56000+ annotated chains, which doubles the scope of previous studies) investigation of cellular functions and localizations that are enriched in the disorder in the human proteome. RAPID, which allows for batch (proteome-wide) predictions, is available as a web server at http://biomine.ece.ualberta.ca/RAPID/.
Subject(s)
Computational Biology , Proteins/chemistry , Proteomics , Software , Databases, Protein , Humans , Sequence AlignmentABSTRACT
Proteins with long disordered regions (LDRs), defined as having 30 or more consecutive disordered residues, are abundant in eukaryotes, and these regions are recognized as a distinct class of biologically functional domains. LDRs facilitate various cellular functions and are important for target selection in structural genomics. Motivated by the lack of methods that directly predict proteins with LDRs, we designed Super-fast predictor of proteins with Long Intrinsically DisordERed regions (SLIDER). SLIDER utilizes logistic regression that takes an empirically chosen set of numerical features, which consider selected physicochemical properties of amino acids, sequence complexity, and amino acid composition, as its inputs. Empirical tests show that SLIDER offers competitive predictive performance combined with low computational cost. It outperforms, by at least a modest margin, a comprehensive set of modern disorder predictors (that can indirectly predict LDRs) and is 16 times faster compared to the best currently available disorder predictor. Utilizing our time-efficient predictor, we characterized abundance and functional roles of proteins with LDRs over 110 eukaryotic proteomes. Similar to related studies, we found that eukaryotes have many (on average 30.3%) proteins with LDRs with majority of proteomes having between 25 and 40%, where higher abundance is characteristic to proteomes that have larger proteins. Our first-of-its-kind large-scale functional analysis shows that these proteins are enriched in a number of cellular functions and processes including certain binding events, regulation of catalytic activities, cellular component organization, biogenesis, biological regulation, and some metabolic and developmental processes. A webserver that implements SLIDER is available at http://biomine.ece.ualberta.ca/SLIDER/.
Subject(s)
Algorithms , Protein Conformation , Proteins/chemistry , Proteins/genetics , Proteomics/methods , Software , Amino Acids/genetics , Databases, Protein , Internet , Logistic ModelsABSTRACT
Structural genomics programs have developed and applied structure-determination pipelines to a wide range of protein targets, facilitating the visualization of macromolecular interactions and the understanding of their molecular and biochemical functions. The fundamental question of whether three-dimensional structures of all proteins and all functional annotations can be determined using X-ray crystallography is investigated. A first-of-its-kind large-scale analysis of crystallization propensity for all proteins encoded in 1953 fully sequenced genomes was performed. It is shown that current X-ray crystallographic knowhow combined with homology modeling can provide structures for 25% of modeling families (protein clusters for which structural models can be obtained through homology modeling), with at least one structural model produced for each Gene Ontology functional annotation. The coverage varies between superkingdoms, with 19% for eukaryotes, 35% for bacteria and 49% for archaea, and with those of viruses following the coverage values of their hosts. It is shown that the crystallization propensities of proteomes from the taxonomic superkingdoms are distinct. The use of knowledge-based target selection is shown to substantially increase the ability to produce X-ray structures. It is demonstrated that the human proteome has one of the highest attainable coverage values among eukaryotes, and GPCR membrane proteins suitable for X-ray structure determination were determined.
Subject(s)
Crystallography, X-Ray/methods , Proteome/chemistry , Proteomics/methods , Animals , Databases, Protein , Humans , Protein Conformation , Proteins/chemistry , Receptors, G-Protein-Coupled/chemistry , Structural Homology, ProteinABSTRACT
OBJECTIVE: To compare the osteoclastogenic capacity of peripheral blood mononuclear cells (PBMCs) from patients with osteoarthritis (OA) to that of PBMCs from self-reported normal individuals. METHODS: PBMCs from 140 patients with OA and 45 healthy donors were assayed for CD14+ expression and induced to differentiate into osteoclasts over 3 weeks in vitro. We assessed the number of osteoclasts, their resorptive activity, osteoclast apoptosis, and expression of the following cytokine receptors: RANK, interleukin-1 receptor type I (IL-1RI), and IL-1RII. A ridge logistic regression classifier was developed to discriminate OA patients from controls. RESULTS: PBMCs from OA patients gave rise to more osteoclasts that resorbed more bone surface than did PBMCs from controls. The number of CD14+ precursors was comparable in both groups, but there was less apoptosis in osteoclasts obtained from OA patients. Although no correlation was found between osteoclastogenic capacity and clinical or radiographic scores, levels of IL-1RI were significantly lower in cultures from patients with OA than in cultures from controls. Osteoclast apoptosis and expression levels of IL-1RI and IL-1RII were used to build a multivariate predictive model for OA. CONCLUSION: During 3 weeks of culture under identical conditions, monocytes from patients with OA display enhanced capacity to generate osteoclasts compared to cells from controls. Enhanced osteoclastogenesis is accompanied by increased resorptive activity, reduced osteoclast apoptosis, and diminished IL-1RI expression. These findings support the possibility that generalized changes in bone metabolism affecting osteoclasts participate in the pathophysiology of OA.
Subject(s)
Apoptosis/immunology , Bone Resorption/immunology , Cytokines/metabolism , Monocytes/cytology , Osteoarthritis/immunology , Osteoclasts/cytology , Aged , Aged, 80 and over , Bone Resorption/metabolism , Bone Resorption/physiopathology , Cell Culture Techniques , Female , Humans , Immunoblotting , Lipopolysaccharide Receptors , Male , Middle Aged , Monocytes/immunology , Monocytes/metabolism , Osteoarthritis/metabolism , Osteoclasts/metabolism , Osteoclasts/physiology , Reverse Transcriptase Polymerase Chain ReactionABSTRACT
Sequence-based prediction of protein secondary structure (SS) enjoys wide-spread and increasing use for the analysis and prediction of numerous structural and functional characteristics of proteins. The lack of a recent comprehensive and large-scale comparison of the numerous prediction methods results in an often arbitrary selection of a SS predictor. To address this void, we compare and analyze 12 popular, standalone and high-throughput predictors on a large set of 1975 proteins to provide in-depth, novel and practical insights. We show that there is no universally best predictor and thus detailed comparative studies are needed to support informed selection of SS predictors for a given application. Our study shows that the three-state accuracy (Q3) and segment overlap (SOV3) of the SS prediction currently reach 82% and 81%, respectively. We demonstrate that carefully designed consensus-based predictors improve the Q3 by additional 2% and that homology modeling-based methods are significantly better by 1.5% Q3 than ab initio approaches. Our empirical analysis reveals that solvent exposed and flexible coils are predicted with a higher quality than the buried and rigid coils, while inverse is true for the strands and helices. We also show that longer helices are easier to predict, which is in contrast to longer strands that are harder to find. The current methods confuse 1-6% of strand residues with helical residues and vice versa and they perform poorly for residues in the ß- bridge and 3(10)-helix conformations. Finally, we compare predictions of the standalone implementations of four well-performing methods with their corresponding web servers.
Subject(s)
Algorithms , Protein Structure, Secondary , Proteins/chemistry , Databases, Protein , Models, Molecular , Solvents/chemistryABSTRACT
MOTIVATION: Nucleotides are multifunctional molecules that are essential for numerous biological processes. They serve as sources for chemical energy, participate in the cellular signaling and they are involved in the enzymatic reactions. The knowledge of the nucleotide-protein interactions helps with annotation of protein functions and finds applications in drug design. RESULTS: We propose a novel ensemble of accurate high-throughput predictors of binding residues from the protein sequence for ATP, ADP, AMP, GTP and GDP. Empirical tests show that our NsitePred method significantly outperforms existing predictors and approaches based on sequence alignment and residue conservation scoring. The NsitePred accurately finds more binding residues and binding sites and it performs particularly well for the sites with residues that are clustered close together in the sequence. The high predictive quality stems from the usage of novel, comprehensive and custom-designed inputs that utilize information extracted from the sequence, evolutionary profiles, several sequence-predicted structural descriptors and sequence alignment. Analysis of the predictive model reveals several sequence-derived hallmarks of nucleotide-binding residues; they are usually conserved and flanked by less conserved residues, and they are associated with certain arrangements of secondary structures and amino acid pairs in the specific neighboring positions in the sequence. AVAILABILITY: http://biomine.ece.ualberta.ca/nSITEpred/ CONTACT: lkurgan@ece.ualberta.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Nucleotides/metabolism , Proteins/chemistry , Software , Binding Sites , Nucleotides/chemistry , Protein Structure, Secondary , Sequence Alignment/methods , Structural Homology, ProteinABSTRACT
MOTIVATION: Molecular recognition features (MoRFs) are short binding regions located within longer intrinsically disordered regions that bind to protein partners via disorder-to-order transitions. MoRFs are implicated in important processes including signaling and regulation. However, only a limited number of experimentally validated MoRFs is known, which motivates development of computational methods that predict MoRFs from protein chains. RESULTS: We introduce a new MoRF predictor, MoRFpred, which identifies all MoRF types (α, ß, coil and complex). We develop a comprehensive dataset of annotated MoRFs to build and empirically compare our method. MoRFpred utilizes a novel design in which annotations generated by sequence alignment are fused with predictions generated by a Support Vector Machine (SVM), which uses a custom designed set of sequence-derived features. The features provide information about evolutionary profiles, selected physiochemical properties of amino acids, and predicted disorder, solvent accessibility and B-factors. Empirical evaluation on several datasets shows that MoRFpred outperforms related methods: α-MoRF-Pred that predicts α-MoRFs and ANCHOR which finds disordered regions that become ordered when bound to a globular partner. We show that our predicted (new) MoRF regions have non-random sequence similarity with native MoRFs. We use this observation along with the fact that predictions with higher probability are more accurate to identify putative MoRF regions. We also identify a few sequence-derived hallmarks of MoRFs. They are characterized by dips in the disorder predictions and higher hydrophobicity and stability when compared to adjacent (in the chain) residues. AVAILABILITY: http://biomine.ece.ualberta.ca/MoRFpred/; http://biomine.ece.ualberta.ca/MoRFpred/Supplement.pdf.
Subject(s)
Computational Biology/methods , Proteins/analysis , Sequence Alignment , Amino Acids , Binding Sites , Hydrophobic and Hydrophilic Interactions , Molecular Sequence Annotation , Protein Structure, Secondary , Support Vector MachineABSTRACT
Many proteins and protein regions are disordered in their native, biologically active states. These proteins/regions are abundant in different organisms and carry out important biological functions that complement the functional repertoire of ordered proteins. Viruses, with their highly compact genomes, small proteomes, and high adaptability for fast change in their biological and physical environment utilize many of the advantages of intrinsic disorder. In fact, viral proteins are generally rich in intrinsic disorder, and intrinsically disordered regions are commonly used by viruses to invade the host organisms, to hijack various host systems, and to help viruses in accommodation to their hostile habitats and to manage their economic usage of genetic material. In this review, we focus on the structural peculiarities of HIV-1 proteins, on the abundance of intrinsic disorder in viral proteins, and on the role of intrinsic disorder in their functions.
Subject(s)
HIV-1/chemistry , Retroviridae Proteins/chemistry , HIV-1/enzymology , HIV-1/metabolism , Models, Molecular , Protein Conformation , Retroviridae Proteins/metabolismABSTRACT
MOTIVATION: X-ray crystallography-based protein structure determination, which accounts for majority of solved structures, is characterized by relatively low success rates. One solution is to build tools which support selection of targets that are more likely to crystallize. Several in silico methods that predict propensity of diffraction-quality crystallization from protein chains were developed. We show that the quality of their predictions drops when applied to more recent crystallization trails, which calls for new solutions. We propose a novel approach that alleviates drawbacks of the existing methods by using a recent dataset and improved protocol to annotate progress along the crystallization process, by predicting the success of the entire process and steps which result in the failed attempts, and by utilizing a compact and comprehensive set of sequence-derived inputs to generate accurate predictions. RESULTS: The proposed PPCpred (predictor of protein Production, Purification and Crystallization) predict propensity for production of diffraction-quality crystals, production of crystals, purification and production of the protein material. PPCpred utilizes comprehensive set of inputs based on energy and hydrophobicity indices, composition of certain amino acid types, predicted disorder, secondary structure and solvent accessibility, and content of certain buried and exposed residues. Our method significantly outperforms alignment-based predictions and several modern crystallization propensity predictors. Receiver operating characteristic (ROC) curves show that PPCpred is particularly useful for users who desire high true positive (TP) rates, i.e. low rate of mispredictions for solvable chains. Our model reveals several intuitive factors that influence the success of individual steps and the entire crystallization process, including the content of Cys, buried His and Ser, hydrophobic/hydrophilic segments and the number of predicted disordered segments. AVAILABILITY: http://biomine.ece.ualberta.ca/PPCpred/. CONTACT: lkurgan@ece.ualberta.ca.
Subject(s)
Crystallography, X-Ray/methods , Proteins/chemistry , Amino Acids/analysis , Hydrophobic and Hydrophilic Interactions , ROC CurveABSTRACT
BACKGROUND: Intrinsically disordered proteins play important roles in various cellular activities and their prevalence was implicated in a number of human diseases. The knowledge of the content of the intrinsic disorder in proteins is useful for a variety of studies including estimation of the abundance of disorder in protein families, classes, and complete proteomes, and for the analysis of disorder-related protein functions. The above investigations currently utilize the disorder content derived from the per-residue disorder predictions. We show that these predictions may over-or under-predict the overall amount of disorder, which motivates development of novel tools for direct and accurate sequence-based prediction of the disorder content. RESULTS: We hypothesize that sequence-level aggregation of input information may provide more accurate content prediction when compared with the content extracted from the local window-based residue-level disorder predictors. We propose a novel predictor, DisCon, that takes advantage of a small set of 29 custom-designed descriptors that aggregate and hybridize information concerning sequence, evolutionary profiles, and predicted secondary structure, solvent accessibility, flexibility, and annotation of globular domains. Using these descriptors and a ridge regression model, DisCon predicts the content with low, 0.05, mean squared error and high, 0.68, Pearson correlation. This is a statistically significant improvement over the content computed from outputs of ten modern disorder predictors on a test dataset with proteins that share low sequence identity with the training sequences. The proposed predictive model is analyzed to discuss factors related to the prediction of the disorder content. CONCLUSIONS: DisCon is a high-quality alternative for high-throughput annotation of the disorder content. We also empirically demonstrate that the DisCon's predictions can be used to improve binary annotations of the disordered residues from the real-value disorder propensities generated by current residue-level disorder predictors. The web server that implements the DisCon is available at http://biomine.ece.ualberta.ca/DisCon/.
Subject(s)
Protein Folding , Proteins/chemistry , Amino Acid Sequence , Cell Physiological Phenomena , Humans , Protein Structure, Secondary , Proteins/metabolismABSTRACT
Membrane proteins (MPs) are difficult to identify in genomes and to crystallize, making it hard to determine their tertiary structures. MPs could be categorized into α-helical (AMP) and outer membrane proteins which mostly include beta barrel folds (OMBBs). The AMPs are relatively easy to predict from a protein sequence because they usually include several long membrane-spanning hydrophobic α-helices. The OMBBs play important roles in cell biology, they are targeted by multiple drugs, and they are more challenging to identify as they have shorter membrane-spanning regions which lack a folding pattern, that is, as consistent as in the case of the AMPs. Hence, accurate in silico methods for prediction of OMBBs from their primary sequences are needed. We present an accurate sequence-based predictor of OMBBs, called OMBBpred, which utilizes a Support Vector Machine classifier and a custom-designed set of 34 novel numerical descriptors derived from predicted secondary structures, hydrophobicity, and evolutionary information. Our method outperforms modern existing OMBB predictors and achieves accuracy of above 98% when tested on two existing benchmark datasets and 96% on a new large dataset. OMBBpred reduces the error rates of the second best method, depending on the dataset used, by between 13 and 65%, and generates predictions with high specificity of above 96%. Our solution is a useful tool for high-throughput discovery of the OMBBs on a genome scale and can be found at http://biomine.ece. ualberta.ca/OMBBpred/OMBBpred.htm.
Subject(s)
Membrane Proteins/chemistry , Models, Molecular , Amino Acid Sequence , Bacterial Outer Membrane Proteins/chemistry , Chloroplasts/chemistry , Computer Simulation , Evolution, Molecular , Hydrophobic and Hydrophilic Interactions , Membrane Proteins/classification , Mitochondrial Proteins/chemistry , Protein Folding , Protein Structure, Secondary , Protein Structure, Tertiary , Sequence Homology, Amino AcidABSTRACT
MOTIVATION: Intrinsically disordered proteins play a crucial role in numerous regulatory processes. Their abundance and ubiquity combined with a relatively low quantity of their annotations motivate research toward the development of computational models that predict disordered regions from protein sequences. Although the prediction quality of these methods continues to rise, novel and improved predictors are urgently needed. RESULTS: We propose a novel method, named MFDp (Multilayered Fusion-based Disorder predictor), that aims to improve over the current disorder predictors. MFDp is as an ensemble of 3 Support Vector Machines specialized for the prediction of short, long and generic disordered regions. It combines three complementary disorder predictors, sequence, sequence profiles, predicted secondary structure, solvent accessibility, backbone dihedral torsion angles, residue flexibility and B-factors. Our method utilizes a custom-designed set of features that are based on raw predictions and aggregated raw values and recognizes various types of disorder. The MFDp is compared at the residue level on two datasets against eight recent disorder predictors and top-performing methods from the most recent CASP8 experiment. In spite of using training chains with Subject(s)
Protein Structure, Tertiary
, Proteins/chemistry
, Software
, Algorithms
, Amino Acid Sequence
, Base Sequence
, Databases, Protein
, Protein Structure, Secondary
ABSTRACT
BACKGROUND: ATP is a ubiquitous nucleotide that provides energy for cellular activities, catalyzes chemical reactions, and is involved in cellular signalling. The knowledge of the ATP-protein interactions helps with annotation of protein functions and finds applications in drug design. The sequence to structure annotation gap motivates development of high-throughput sequence-based predictors of the ATP-binding residues. Moreover, our empirical tests show that the only existing predictor, ATPint, is characterized by relatively low predictive quality. METHODS: We propose a novel, high-throughput machine learning-based predictor, ATPsite, which identifies ATP-binding residues from protein sequences. Our predictor utilizes Support Vector Machine classifier and a comprehensive set of input features that are based on the sequence, evolutionary profiles, and the sequence-predicted structural descriptors including secondary structure, solvent accessibility, and dihedral angles. RESULTS: The ATPsite achieves significantly higher Mathews Correlation Coefficient (MCC) and Area Under the ROC Curve (AUC) values when compared with the existing methods including the ATPint, conservation-based rate4site, and alignment-based BLAST predictors. We also assessed the effectiveness of individual input types. The PSSM profile, the conservation scores, and certain features based on amino acid groups are shown to be more effective in predicting the ATP-binding residues than the remaining feature groups. CONCLUSIONS: Statistical tests show that ATPsite significantly outperforms existing solutions. The consensus of the ATPsite with the sequence-alignment based predictor is shown to give further improvements.
ABSTRACT
BACKGROUND: Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. RESULTS: The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. CONCLUSIONS: The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
Subject(s)
Computational Biology/methods , Proteins/chemistry , Sequence Analysis, Protein , Databases, Protein , Protein Conformation , Protein Folding , Sequence AlignmentABSTRACT
Production of high-quality crystals is one of the main bottlenecks in the X-ray crystallography driven protein structure determination. Availability of structure determination data repositories, such as TargetDB and PepcDB, and flexibility in target selection in structural genomics motivate development of methods that predict crystallization propensity from a given protein sequence. We introduce a novel linear model tree-based meta-predictor, MetaPPCP, which takes advantage of the complementarity of state-of-the-art protein crystallization propensity predictors to provide predictions with about 80% accuracy. Our method combines predictions of XtalPred and CRYSTALP2 with information concerning isoelectric point, hydropathy and number of solved structures for similar sequences. Empirical comparison shows that MetaPPCP outperforms current predictors including OB-Score, XtalPred, ParCrys, and CRYSTALP2. MetaPPCP obtains over 92% accuracy for over a half of its predictions that have probability (propensity to be predicted as crystallizable or crystallization resistant) of above 0.8. The proposed method could provide useful input for target selection procedures of current structural genomics efforts.