Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
J Org Chem ; 85(14): 9367-9374, 2020 Jul 17.
Article in English | MEDLINE | ID: mdl-32578986

ABSTRACT

The dearomatizing spirocyclization of phenolic biarylic ketones using PhI(OCOCF3)2 as oxidant is presented. The reaction affords various cyclohexadienones through C-C bond cleavage under mild conditions. Mechanistic investigations reveal that an exocyclic enol ether acts as the key intermediate in the transformation.

2.
Mol Genet Genomics ; 289(3): 489-99, 2014 Jun.
Article in English | MEDLINE | ID: mdl-24448651

ABSTRACT

Protein-DNA interactions play important roles in many biological processes. To understand the molecular mechanisms of protein-DNA interaction, it is necessary to identify the DNA-binding sites in DNA-binding proteins. In the last decade, computational approaches have been developed to predict protein-DNA-binding sites based solely on protein sequences. In this study, we developed a novel predictor based on support vector machine algorithm coupled with the maximum relevance minimum redundancy method followed by incremental feature selection. We incorporated not only features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure, solvent accessibility, but also five three-dimensional (3D) structural features calculated from PDB data to predict the protein-DNA interaction sites. Feature analysis showed that 3D structural features indeed contributed to the prediction of DNA-binding site and it was demonstrated that the prediction performance was better with 3D structural features than without them. It was also shown via analysis of features from each site that the features of DNA-binding site itself contribute the most to the prediction. Our prediction method may become a useful tool for identifying the DNA-binding sites and the feature analysis described in this paper may provide useful insights for in-depth investigations into the mechanisms of protein-DNA interaction.


Subject(s)
Binding Sites , Computational Biology/methods , DNA-Binding Proteins/chemistry , DNA/chemistry , Support Vector Machine , Algorithms , DNA/metabolism , DNA-Binding Proteins/metabolism , Molecular Conformation , Protein Binding , Reproducibility of Results
3.
J Biomol Struct Dyn ; 33(11): 2479-90, 2015.
Article in English | MEDLINE | ID: mdl-25616595

ABSTRACT

Lysine acetylation and ubiquitination are two primary post-translational modifications (PTMs) in most eukaryotic proteins. Lysine residues are targets for both types of PTMs, resulting in different cellular roles. With the increasing availability of protein sequences and PTM data, it is challenging to distinguish the two types of PTMs on lysine residues. Experimental approaches are often laborious and time consuming. There is an urgent need for computational tools to distinguish between lysine acetylation and ubiquitination. In this study, we developed a novel method, called DAUFSA (distinguish between lysine acetylation and lysine ubiquitination with feature selection and analysis), to discriminate ubiquitinated and acetylated lysine residues. The method incorporated several types of features: PSSM (position-specific scoring matrix) conservation scores, amino acid factors, secondary structures, solvent accessibilities, and disorder scores. By using the mRMR (maximum relevance minimum redundancy) method and the IFS (incremental feature selection) method, an optimal feature set containing 290 features was selected from all incorporated features. A dagging-based classifier constructed by the optimal features achieved a classification accuracy of 69.53%, with an MCC of .3853. An optimal feature set analysis showed that the PSSM conservation score features and the amino acid factor features were the most important attributes, suggesting differences between acetylation and ubiquitination. Our study results also supported previous findings that different motifs were employed by acetylation and ubiquitination. The feature differences between the two modifications revealed in this study are worthy of experimental validation and further investigation.


Subject(s)
Lysine/chemistry , Lysine/metabolism , Acetylation , Amino Acid Sequence , Computational Biology/methods , Conserved Sequence , Databases, Genetic , Position-Specific Scoring Matrices , Protein Conformation , Protein Processing, Post-Translational , Ubiquitination
4.
PLoS One ; 9(2): e88300, 2014.
Article in English | MEDLINE | ID: mdl-24505469

ABSTRACT

Lung cancer is one of the leading causes of cancer mortality worldwide and non-small cell lung cancer (NSCLC) accounts for the most part. NSCLC can be further divided into adenocarcinoma (ACA) and squamous cell carcinoma (SCC). It is of great value to distinguish these two subgroups clinically. In this study, we compared the genome-wide copy number alterations (CNAs) patterns of 208 early stage ACA and 93 early stage SCC tumor samples. As a result, 266 CNA probes stood out for better discrimination of ACA and SCC. It was revealed that the genes corresponding to these 266 probes were enriched in lung cancer related pathways and enriched in the chromosome regions where CNA usually occur in lung cancer. This study sheds lights on the CNA study of NSCLC and provides some insights on the epigenetic of NSCLC.


Subject(s)
Adenocarcinoma/genetics , Carcinoma, Non-Small-Cell Lung/genetics , Carcinoma, Squamous Cell/genetics , DNA Copy Number Variations , Lung Neoplasms/genetics , Adenocarcinoma/classification , Adenocarcinoma/pathology , Carcinoma, Non-Small-Cell Lung/classification , Carcinoma, Non-Small-Cell Lung/pathology , Carcinoma, Squamous Cell/classification , Carcinoma, Squamous Cell/pathology , Gene Dosage , Humans , Lung/metabolism , Lung/pathology , Lung Neoplasms/classification , Lung Neoplasms/pathology
5.
PLoS One ; 9(9): e107464, 2014.
Article in English | MEDLINE | ID: mdl-25222670

ABSTRACT

Post-translational modifications (PTMs) are crucial steps in protein synthesis and are important factors contributing to protein diversity. PTMs play important roles in the regulation of gene expression, protein stability and metabolism. Lysine residues in protein sequences have been found to be targeted for both types of PTMs: sumoylations and acetylations; however, each PTM has a different cellular role. As experimental approaches are often laborious and time consuming, it is challenging to distinguish the two types of PTMs on lysine residues using computational methods. In this study, we developed a method to discriminate between sumoylated lysine residues and acetylated residues. The method incorporated several features: PSSM conservation scores, amino acid factors, secondary structures, solvent accessibilities and disorder scores. By using the mRMR (Maximum Relevance Minimum Redundancy) method and the IFS (Incremental Feature Selection) method, an optimal feature set was selected from all of the incorporated features, with which the classifier achieved 92.14% accuracy with an MCC value of 0.7322. Analysis of the optimal feature set revealed some differences between acetylation and sumoylation. The results from our study also supported the previous finding that there exist different consensus motifs for the two types of PTMs. The results could suggest possible dominant factors governing the acetylation and sumoylation of lysine residues, shedding some light on the modification dynamics and molecular mechanisms of the two types of PTMs, and provide guidelines for experimental validations.


Subject(s)
Lysine/metabolism , Acetylation , Sumoylation
6.
PLoS One ; 9(1): e86729, 2014.
Article in English | MEDLINE | ID: mdl-24466214

ABSTRACT

Aptamers are oligonucleic acid or peptide molecules that bind to specific target molecules. As a novel and powerful class of ligands, aptamers are thought to have excellent potential for applications in the fields of biosensing, diagnostics and therapeutics. In this study, a new method for predicting aptamer-target interacting pairs was proposed by integrating features derived from both aptamers and their targets. Features of nucleotide composition and traditional amino acid composition as well as pseudo amino acid were utilized to represent aptamers and targets, respectively. The predictor was constructed based on Random Forest and the optimal features were selected by using the maximum relevance minimum redundancy (mRMR) method and the incremental feature selection (IFS) method. As a result, 81.34% accuracy and 0.4612 MCC were obtained for the training dataset, and 77.41% accuracy and 0.3717 MCC were achieved for the testing dataset. An optimal feature set of 220 features were selected, which were considered as the ones that contributed significantly to the interacting aptamer-target pair predictions. Analysis of the optimal feature set indicated several important factors in determining aptamer-target interactions. It is anticipated that our prediction method may become a useful tool for identifying aptamer-target pairs and the features selected and analyzed in this study may provide useful insights into the mechanism of interactions between aptamers and targets.


Subject(s)
Aptamers, Nucleotide/chemistry , Aptamers, Peptide/chemistry , Computational Biology/methods , Models, Genetic , Algorithms , Amino Acids/analysis , Artificial Intelligence , Base Composition , Ligands , Structure-Activity Relationship
7.
Biomed Res Int ; 2014: 438341, 2014.
Article in English | MEDLINE | ID: mdl-25184139

ABSTRACT

Protein S-nitrosylation plays a very important role in a wide variety of cellular biological activities. Hitherto, accurate prediction of S-nitrosylation sites is still of great challenge. In this paper, we presented a framework to computationally predict S-nitrosylation sites based on kernel sparse representation classification and minimum Redundancy Maximum Relevance algorithm. As much as 666 features derived from five categories of amino acid properties and one protein structure feature are used for numerical representation of proteins. A total of 529 protein sequences collected from the open-access databases and published literatures are used to train and test our predictor. Computational results show that our predictor achieves Matthews' correlation coefficients of 0.1634 and 0.2919 for the training set and the testing set, respectively, which are better than those of k-nearest neighbor algorithm, random forest algorithm, and sparse representation classification algorithm. The experimental results also indicate that 134 optimal features can better represent the peptides of protein S-nitrosylation than the original 666 redundant features. Furthermore, we constructed an independent testing set of 113 protein sequences to evaluate the robustness of our predictor. Experimental result showed that our predictor also yielded good performance on the independent testing set with Matthews' correlation coefficients of 0.2239.


Subject(s)
Algorithms , Computational Biology , Protein Processing, Post-Translational , Proteins/chemistry , Amino Acid Sequence , Amino Acids/chemistry , Amino Acids/genetics , Databases, Protein , Protein Structure, Tertiary , Proteins/genetics , Proteins/metabolism , Software
8.
Protein Pept Lett ; 20(3): 352-63, 2013 Mar.
Article in English | MEDLINE | ID: mdl-22591477

ABSTRACT

Colorectal cancer (CRC) is one of the most malignant cancers. A growing number of studies have shown that both genetic and epigenetic play important roles in the etiology of CRC. Both microRNA (miRNA) and DNA methylation belong to the scope of epigenetic and there are complex regulatory mechanisms within miRNA and DNA methylation. We compiled 71 CRC related genes and 134 CRC related miRNAs. Then we identified 417 feed forward loops (FFLs) and 37 feedback loops (FBLs) among these genes, miRNAs and transcription factors (TFs). We constructed a network of miRNAs and TFs mediation for CRC utilizing these FFLs and FBLs. Statistical tests proved that these FFLs were significantly enriched in the CRC comparing to the esophageal cancer, breast cancer and randomly selected CRCmiRNA-gene pairs. Analysis of the network singled out 3 core genes, 2 core miRNAs and 5 core TFs. The KEGG enrichment and GO enrichment for the 2 core miRNA target genes indicated that they were significantly enriched in CRC related pathways. (Ex. MARK pathway, TGFß pathway and cell cycle) Through the investigation on methylation, we found that most of the CRC related genes and miRNAs were prone to be regulated by methylation. This study sheds lights on the regulatory mechanisms in CRC and we provide some insights on the epigenetic of CRC.


Subject(s)
Colorectal Neoplasms/genetics , DNA Methylation/genetics , Gene Regulatory Networks , MicroRNAs/genetics , Colorectal Neoplasms/metabolism , Colorectal Neoplasms/pathology , Epigenesis, Genetic , Gene Expression Regulation, Neoplastic , Humans , Metabolic Networks and Pathways , MicroRNAs/metabolism , Transcription Factors/genetics
9.
Biomed Res Int ; 2013: 304029, 2013.
Article in English | MEDLINE | ID: mdl-23998122

ABSTRACT

One of the most important and challenging problems in biomedicine is how to predict the cancer related genes. Retinoblastoma (RB) is the most common primary intraocular malignancy usually occurring in childhood. Early detection of RB could reduce the morbidity and promote the probability of disease-free survival. Therefore, it is of great importance to identify RB genes. In this study, we developed a computational method to predict RB related genes based on Dagging, with the maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). 119 RB genes were compiled from two previous RB related studies, while 5,500 non-RB genes were randomly selected from Ensemble genes. Ten datasets were constructed based on all these RB and non-RB genes. Each gene was encoded with a 13,126-dimensional vector including 12,887 Gene Ontology enrichment scores and 239 KEGG enrichment scores. Finally, an optimal feature set including 1061 GO terms and 8 KEGG pathways was obtained. Analysis showed that these features were closely related to RB. It is anticipated that the method can be applied to predict the other cancer related genes as well.


Subject(s)
Databases, Genetic , Gene Ontology , Genes, Neoplasm/genetics , Genetic Markers/genetics , Models, Genetic , Neoplasm Proteins/genetics , Retinoblastoma/genetics , Computer Simulation , Data Mining/methods , Humans
10.
Mol Biosyst ; 9(11): 2729-40, 2013 Nov.
Article in English | MEDLINE | ID: mdl-24056952

ABSTRACT

Protein carbamylation is one of the important post-translational modifications, which plays a pivotal role in a number of biological conditions, such as diseases, chronic renal failure and atherosclerosis. Therefore, recognition and identification of protein carbamylated sites are essential for disease treatment and prevention. Yet the mechanism of action of carbamylated lysine sites is still not realized. Thus it remains a largely unsolved challenge to uncover it, whether experimentally or theoretically. To address this problem, we have presented a computational framework for theoretically predicting and analyzing carbamylated lysine sites based on both the one-class k-nearest neighbor method and two-stage feature selection. The one-class k-nearest neighbor method requires no negative samples in training. Experimental results showed that by using 280 optimal features the presented method achieved promising performances of SN=82.50% for the jackknife test on the training set, and SN=66.67%, SP=100.00% and MCC=0.8097 for the independent test on the testing set, respectively. Further analysis of the optimal features provided insights into the mechanism of action of carbamylated lysine sites. It is anticipated that our method could be a potentially useful and essential tool for biologists to theoretically investigate carbamylated lysine sites.


Subject(s)
Computational Biology/methods , Lysine/metabolism , Protein Processing, Post-Translational , Proteins/metabolism , Acetylation , Algorithms , Databases, Protein , Position-Specific Scoring Matrices , Proteins/chemistry , ROC Curve , Reproducibility of Results , Sensitivity and Specificity , Sumoylation , Ubiquitination
11.
Mol Biosyst ; 9(1): 61-9, 2013 Jan 27.
Article in English | MEDLINE | ID: mdl-23117653

ABSTRACT

Identification of catalytic residues plays a key role in understanding how enzymes work. Although numerous computational methods have been developed to predict catalytic residues and active sites, the prediction accuracy remains relatively low with high false positives. In this work, we developed a novel predictor based on the Random Forest algorithm (RF) aided by the maximum relevance minimum redundancy (mRMR) method and incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility to predict active sites of enzymes and achieved an overall accuracy of 0.885687 and MCC of 0.689226 on an independent test dataset. Feature analysis showed that every category of the features except disorder contributed to the identification of active sites. It was also shown via the site-specific feature analysis that the features derived from the active site itself contributed most to the active site determination. Our prediction method may become a useful tool for identifying the active sites and the key features identified by the paper may provide valuable insights into the mechanism of catalysis.


Subject(s)
Computational Biology/methods , Enzymes/chemistry , Enzymes/metabolism , Models, Chemical , Catalytic Domain , Chemical Phenomena , Conserved Sequence , Databases, Protein , Decision Trees , Protein Structure, Secondary , Sequence Analysis, Protein , Structure-Activity Relationship , Support Vector Machine
12.
PLoS One ; 8(5): e63494, 2013.
Article in English | MEDLINE | ID: mdl-23658834

ABSTRACT

Colorectal cancer can be grouped into Dukes A, B, C, and D stages based on its developments. Generally speaking, more advanced patients have poorer prognosis. To integrate progression stage prediction systems with recurrence prediction systems, we proposed an ensemble prognostic model for colorectal cancer. In this model, each patient was assigned a most possible stage and a most possible recurrence status. If a patient was predicted to be recurrence patient in advanced stage, he would be classified into high risk group. The ensemble model considered both progression stages and recurrence status. High risk patients and low risk patients predicted by the ensemble model had a significant different disease free survival (log-rank test p-value, 0.0016) and disease specific survival (log-rank test p-value, 0.0041). The ensemble model can better distinguish the high risk and low risk patients than the stage prediction model and the recurrence prediction model alone. This method could be applied to the studies of other diseases and it could significantly improve the prediction performance by ensembling heterogeneous information.


Subject(s)
Colorectal Neoplasms/diagnosis , Models, Statistical , Colorectal Neoplasms/pathology , Disease Progression , Disease-Free Survival , Humans , Neoplasm Staging , Recurrence , Risk Assessment , Survival Rate
13.
Biomed Res Int ; 2013: 723780, 2013.
Article in English | MEDLINE | ID: mdl-24083237

ABSTRACT

Drug combinatorial therapy could be more effective in treating some complex diseases than single agents due to better efficacy and reduced side effects. Although some drug combinations are being used, their underlying molecular mechanisms are still poorly understood. Therefore, it is of great interest to deduce a novel drug combination by their molecular mechanisms in a robust and rigorous way. This paper attempts to predict effective drug combinations by a combined consideration of: (1) chemical interaction between drugs, (2) protein interactions between drugs' targets, and (3) target enrichment of KEGG pathways. A benchmark dataset was constructed, consisting of 121 confirmed effective combinations and 605 random combinations. Each drug combination was represented by 465 features derived from the aforementioned three properties. Some feature selection techniques, including Minimum Redundancy Maximum Relevance and Incremental Feature Selection, were adopted to extract the key features. Random forest model was built with its performance evaluated by 5-fold cross-validation. As a result, 55 key features providing the best prediction result were selected. These important features may help to gain insights into the mechanisms of drug combinations, and the proposed prediction model could become a useful tool for screening possible drug combinations.


Subject(s)
Computational Biology/methods , Drug Combinations , Drug Interactions , Pharmaceutical Preparations/metabolism , Proteins/metabolism , Signal Transduction , Algorithms , ROC Curve
14.
Biomed Res Int ; 2013: 414327, 2013.
Article in English | MEDLINE | ID: mdl-23710446

ABSTRACT

With a large number of disordered proteins and their important functions discovered, it is highly desired to develop effective methods to computationally predict protein disordered regions. In this study, based on Random Forest (RF), Maximum Relevancy Minimum Redundancy (mRMR), and Incremental Feature Selection (IFS), we developed a new method to predict disordered regions in proteins. The mRMR criterion was used to rank the importance of all candidate features. Finally, top 128 features were selected from the ranked feature list to build the optimal model, including 92 Position Specific Scoring Matrix (PSSM) conservation score features and 36 secondary structure features. As a result, Matthews correlation coefficient (MCC) of 0.3895 was achieved on the training set by 10-fold cross-validation. On the basis of predicting results for each query sequence by using the method, we used the scanning and modification strategy to improve the performance. The accuracy (ACC) and MCC were increased by 4% and almost 0.2%, respectively, compared with other three popular predictors: DISOPRED, DISOclust, and OnD-CRF. The selected features may shed some light on the understanding of the formation mechanism of disordered structures, providing guidelines for experimental validation.


Subject(s)
Algorithms , Computational Biology , Proteins/chemistry , Sequence Analysis, Protein , Position-Specific Scoring Matrices , Protein Structure, Secondary , Protein Structure, Tertiary
15.
Biomed Res Int ; 2013: 267375, 2013.
Article in English | MEDLINE | ID: mdl-23762832

ABSTRACT

Lung cancer is one of the leading causes of cancer mortality worldwide. The main types of lung cancer are small cell lung cancer (SCLC) and nonsmall cell lung cancer (NSCLC). In this work, a computational method was proposed for identifying lung-cancer-related genes with a shortest path approach in a protein-protein interaction (PPI) network. Based on the PPI data from STRING, a weighted PPI network was constructed. 54 NSCLC- and 84 SCLC-related genes were retrieved from associated KEGG pathways. Then the shortest paths between each pair of these 54 NSCLC genes and 84 SCLC genes were obtained with Dijkstra's algorithm. Finally, all the genes on the shortest paths were extracted, and 25 and 38 shortest genes with a permutation P value less than 0.05 for NSCLC and SCLC were selected for further analysis. Some of the shortest path genes have been reported to be related to lung cancer. Intriguingly, the candidate genes we identified from the PPI network contained more cancer genes than those identified from the gene expression profiles. Furthermore, these genes possessed more functional similarity with the known cancer genes than those identified from the gene expression profiles. This study proved the efficiency of the proposed method and showed promising results.


Subject(s)
Gene Expression Profiling/methods , Genes, Neoplasm/genetics , Lung Neoplasms/genetics , Protein Interaction Maps/genetics , Carcinoma, Non-Small-Cell Lung/genetics , Gene Expression Regulation, Neoplastic , Genetic Association Studies , Humans , Small Cell Lung Carcinoma/genetics
16.
PLoS One ; 8(6): e66678, 2013.
Article in English | MEDLINE | ID: mdl-23805260

ABSTRACT

Most of pyruvoyl-dependent proteins observed in prokaryotes and eukaryotes are critical regulatory enzymes, which are primary targets of inhibitors for anti-cancer and anti-parasitic therapy. These proteins undergo an autocatalytic, intramolecular self-cleavage reaction in which a covalently bound pyruvoyl group is generated on a conserved serine residue. Traditional detections of the modified serine sites are performed by experimental approaches, which are often labor-intensive and time-consuming. In this study, we initiated in an attempt for the computational predictions of such serine sites with Feature Selection based on a Random Forest. Since only a small number of experimentally verified pyruvoyl-modified proteins are collected in the protein database at its current version, we only used a small dataset in this study. After removing proteins with sequence identities >60%, a non-redundant dataset was generated and was used, which contained only 46 proteins, with one pyruvoyl serine site for each protein. Several types of features were considered in our method including PSSM conservation scores, disorders, secondary structures, solvent accessibilities, amino acid factors and amino acid occurrence frequencies. As a result, a pretty good performance was achieved in our dataset. The best 100.00% accuracy and 1.0000 MCC value were obtained from the training dataset, and 93.75% accuracy and 0.8441 MCC value from the testing dataset. The optimal feature set contained 9 features. Analysis of the optimal feature set indicated the important roles of some specific features in determining the pyruvoyl-group-serine sites, which were consistent with several results of earlier experimental studies. These selected features may shed some light on the in-depth understanding of the mechanism of the post-translational self-maturation process, providing guidelines for experimental validation. Future work should be made as more pyruvoyl-modified proteins are found and the method should be evaluated on larger datasets. At last, the predicting software can be downloaded from http://www.nkbiox.com/sub/pyrupred/index.html.


Subject(s)
Computational Biology/methods , Proteins/metabolism , Serine/metabolism , Algorithms , Area Under Curve , Databases, Protein , ROC Curve
17.
PLoS One ; 8(6): e65207, 2013.
Article in English | MEDLINE | ID: mdl-23762317

ABSTRACT

Acquired immune deficiency syndrome (AIDS) is a severe infectious disease that causes a large number of deaths every year. Traditional anti-AIDS drugs directly targeting the HIV-1 encoded enzymes including reverse transcriptase (RT), protease (PR) and integrase (IN) usually suffer from drug resistance after a period of treatment and serious side effects. In recent years, the emergence of numerous useful information of protein-protein interactions (PPI) in the HIV life cycle and related inhibitors makes PPI a new way for antiviral drug intervention. In this study, we identified 26 core human proteins involved in PPI between HIV-1 and host, that have great potential for HIV therapy. In addition, 280 chemicals that interact with three HIV drugs targeting human proteins can also interact with these 26 core proteins. All these indicate that our method as presented in this paper is quite promising. The method may become a useful tool, or at least plays a complementary role to the existing method, for identifying novel anti-HIV drugs.


Subject(s)
Algorithms , Anti-HIV Agents/chemistry , HIV Infections/drug therapy , HIV-1/drug effects , Protein Interaction Mapping , Protein Interaction Maps , 1-Deoxynojirimycin/analogs & derivatives , 1-Deoxynojirimycin/chemistry , 1-Deoxynojirimycin/pharmacology , Anti-HIV Agents/pharmacology , CCR5 Receptor Antagonists , Computer Simulation , Cyclohexanes/chemistry , Cyclohexanes/pharmacology , Databases, Chemical , Didanosine/chemistry , Didanosine/pharmacology , Drug Design , Drug Discovery , HIV Infections/virology , HIV-1/genetics , HIV-1/metabolism , Host-Pathogen Interactions , Humans , Maraviroc , Models, Molecular , Receptors, CCR5/chemistry , Receptors, CCR5/metabolism , Triazoles/chemistry , Triazoles/pharmacology
18.
PLoS One ; 7(8): e43927, 2012.
Article in English | MEDLINE | ID: mdl-22937126

ABSTRACT

Prediction of protein-protein interaction (PPI) sites is one of the most challenging problems in computational biology. Although great progress has been made by employing various machine learning approaches with numerous characteristic features, the problem is still far from being solved. In this study, we developed a novel predictor based on Random Forest (RF) algorithm with the Minimum Redundancy Maximal Relevance (mRMR) method followed by incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility. We also included five 3D structural features to predict protein-protein interaction sites and achieved an overall accuracy of 0.672997 and MCC of 0.347977. Feature analysis showed that 3D structural features such as Depth Index (DPX) and surface curvature (SC) contributed most to the prediction of protein-protein interaction sites. It was also shown via site-specific feature analysis that the features of individual residues from PPI sites contribute most to the determination of protein-protein interaction sites. It is anticipated that our prediction method will become a useful tool for identifying PPI sites, and that the feature analysis described in this paper will provide useful insights into the mechanisms of interaction.


Subject(s)
Computational Biology/methods , Proteins/metabolism , Algorithms , Protein Conformation
19.
PLoS One ; 7(9): e45854, 2012.
Article in English | MEDLINE | ID: mdl-23029276

ABSTRACT

Proteinases play critical roles in both intra and extracellular processes by binding and cleaving their protein substrates. The cleavage can either be non-specific as part of degradation during protein catabolism or highly specific as part of proteolytic cascades and signal transduction events. Identification of these targets is extremely challenging. Current computational approaches for predicting cleavage sites are very limited since they mainly represent the amino acid sequences as patterns or frequency matrices. In this work, we developed a novel predictor based on Random Forest algorithm (RF) using maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). The features of physicochemical/biochemical properties, sequence conservation, residual disorder, amino acid occurrence frequency, secondary structure and solvent accessibility were utilized to represent the peptides concerned. Here, we compared existing prediction tools which are available for predicting possible cleavage sites in candidate substrates with ours. It is shown that our method makes much more reliable predictions in terms of the overall prediction accuracy. In addition, this predictor allows the use of a wide range of proteinases.


Subject(s)
Models, Molecular , Proteolysis , Algorithms , Amino Acid Motifs , Amino Acid Sequence , Conserved Sequence , Decision Trees , Molecular Sequence Data , Peptide Hydrolases/chemistry , Proteasome Endopeptidase Complex/chemistry , Sequence Analysis, Protein
20.
J Proteomics ; 75(5): 1654-65, 2012 Feb 16.
Article in English | MEDLINE | ID: mdl-22178444

ABSTRACT

S-nitrosylation (SNO) is one of the most important and universal post-translational modifications (PTMs) which regulates various cellular functions and signaling events. Identification of the exact S-nitrosylation sites in proteins may facilitate the understanding of the molecular mechanisms and biological function of S-nitrosylation. Unfortunately, traditional experimental approaches used for detecting S-nitrosylation sites are often laborious and time-consuming. However, computational methods could overcome this demerit. In this work, we developed a novel predictor based on nearest neighbor algorithm (NNA) with the maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). The features of physicochemical/biochemical properties, sequence conservation, residual disorder, amino acid occurrence frequency, second structure and the solvent accessibility were utilized to represent the peptides concerned. Feature analysis showed that the features except residual disorder affected identification of the S-nitrosylation sites. It was also shown via the site-specific feature analysis that the features of sites away from the central cysteine might contribute to the S-nitrosylation site determination through a subtle manner. It is anticipated that our prediction method may become a useful tool for identifying the protein S-nitrosylation sites and that the features analysis described in this paper may provide useful insights for in-depth investigation into the mechanism of S-nitrosylation.


Subject(s)
Algorithms , Protein Processing, Post-Translational , Proteins/chemistry , Sequence Analysis, Protein/methods , Animals , Humans , Protein Structure, Secondary , Proteins/genetics , Proteins/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL