RESUMEN
SUMOylation is a protein post-translational modification that plays an essential role in cellular functions. For predicting SUMO sites, numerous researchers have proposed advanced methods based on ordinary machine learning algorithms. These reported methods have shown excellent predictive performance, but there is room for improvement. In this study, we constructed a novel deep neural network Residual Pyramid Network (RsFPN), and developed an ensemble deep learning predictor called iSUMO-RsFPN. Initially, three feature extraction methods were employed to extract features from samples. Following this, weak classifiers were trained based on RsFPN for each feature type. Ultimately, the weak classifiers were integrated to construct the final classifier. Moreover, the predictor underwent systematically testing on an independent test dataset, where the results demonstrated a significant improvement over the existing state-of-the-art predictors. The code of iSUMO-RsFPN is free and available at https://github.com/454170054/iSUMO-RsFPN.
Asunto(s)
Lisina , Sumoilación , Redes Neurales de la Computación , Aprendizaje Automático , AlgoritmosRESUMEN
DNA methylation is an important epigenetics, which occurs in the early stages of tumor formation. And it also is of great significance to find the relationship between DNA methylation and cancer. This paper proposes a novel model, iCancer-Pred, to identify cancer and classify its types further. The datasets of DNA methylation information of 7 cancer types have been collected from The Cancer Genome Atlas (TCGA). The coefficient of variation firstly is used to reduce the number of features, and then the elastic network is applied to select important features. Finally, a fully connected neural network is constructed with these selected features. In predicting seven types of cancers, iCancer-Pred has achieved an overall accuracy of over 97% accuracy with 5-fold cross-validation. For the convenience of the application, a user-friendly web server: http://bioinfo.jcu.edu.cn/cancer or http://121.36.221.79/cancer/ is available. And the source codes are freely available for download at https://github.com/Huerhu/iCancer-Pred.
Asunto(s)
Metilación de ADN , Neoplasias , Humanos , Epigenómica , Neoplasias/genéticaRESUMEN
A norm in modern medicine is to prescribe polypharmacy to treat disease. The core concern with the co-administration of drugs is that it may produce adverse drug-drug interaction (DDI), which can cause unexpected bodily injury. Therefore, it is essential to identify potential DDI. Most existing methods in silico only judge whether two drugs interact, ignoring the importance of interaction events to study the mechanism implied in combination drugs. In this work, we propose a deep learning framework named MSEDDI that comprehensively considers multi-scale embedding representations of the drug for predicting drug-drug interaction events. In MSEDDI, we design three-channel networks to process biomedical network-based knowledge graph embedding, SMILES sequence-based notation embedding, and molecular graph-based chemical structure embedding, respectively. Finally, we fuse three heterogeneous features from channel outputs through a self-attention mechanism and feed them to the linear layer predictor. In the experimental section, we evaluate the performance of all methods on two different prediction tasks on two datasets. The results show that MSEDDI outperforms other state-of-the-art baselines. Moreover, we also reveal the stable performance of our model in a broader sample set via case studies.
Asunto(s)
Bases del Conocimiento , Polifarmacia , Humanos , Interacciones FarmacológicasRESUMEN
BACKGROUND: Lysine succinylation is a newly discovered protein post-translational modifications. Predicting succinylation sites helps investigate the metabolic disease treatments. However, the biological experimental approaches are costly and inefficient, it is necessary to develop efficient computational approaches. RESULTS: In this paper, we proposed a novel predictor based on ensemble dense blocks and an attention module, called as pSuc-EDBAM, which adopted one hot encoding to derive the feature maps of protein sequences, and generated the low-level feature maps through 1-D CNN. Afterward, the ensemble dense blocks were used to capture feature information at different levels in the process of feature learning. We also introduced an attention module to evaluate the importance degrees of different features. The experimental results show that Acc reaches 74.25%, and MCC reaches 0.2927 on the testing dataset, which suggest that the pSuc-EDBAM outperforms the existing predictors. CONCLUSIONS: The experimental results of ten-fold cross-validation on the training dataset and independent test on the testing dataset showed that pSuc-EDBAM outperforms the existing succinylation site predictors and can predict potential succinylation sites effectively. The pSuc-EDBAM is feasible and obtains the credible predictive results, which may also provide valuable references for other related research. To make the convenience of the experimental scientists, a user-friendly web server has been established ( http://bioinfo.wugenqiang.top/pSuc-EDBAM/ ), by which the desired results can be easily obtained.
Asunto(s)
Lisina , Ácido Succínico , Lisina/metabolismo , Ácido Succínico/metabolismo , Proteínas/metabolismo , Secuencia de Aminoácidos , Procesamiento Proteico-Postraduccional , Atención , Biología Computacional/métodosRESUMEN
BACKGROUND: In research on new drug discovery, the traditional wet experiment has a long period. Predicting drug-target interaction (DTI) in silico can greatly narrow the scope of search of candidate medications. Excellent algorithm model may be more effective in revealing the potential connection between drug and target in the bioinformatics network composed of drugs, proteins and other related data. RESULTS: In this work, we have developed a heterogeneous graph neural network model, named as HGDTI, which includes a learning phase of network node embedding and a training phase of DTI classification. This method first obtains the molecular fingerprint information of drugs and the pseudo amino acid composition information of proteins, then extracts the initial features of nodes through Bi-LSTM, and uses the attention mechanism to aggregate heterogeneous neighbors. In several comparative experiments, the overall performance of HGDTI significantly outperforms other state-of-the-art DTI prediction models, and the negative sampling technology is employed to further optimize the prediction power of model. In addition, we have proved the robustness of HGDTI through heterogeneous network content reduction tests, and proved the rationality of HGDTI through other comparative experiments. These results indicate that HGDTI can utilize heterogeneous information to capture the embedding of drugs and targets, and provide assistance for drug development. CONCLUSIONS: The HGDTI based on heterogeneous graph neural network model, can utilize heterogeneous information to capture the embedding of drugs and targets, and provide assistance for drug development. For the convenience of related researchers, a user-friendly web-server has been established at http://bioinfo.jcu.edu.cn/hgdti .
Asunto(s)
Biología Computacional , Redes Neurales de la Computación , Algoritmos , Desarrollo de Medicamentos/métodos , Interacciones Farmacológicas , Proteínas/metabolismoRESUMEN
In drug development, unexpected side effects are the main reason for the failure of candidate drug trials. Discovering potential side effects of drugsin silicocan improve the success rate of drug screening. However, most previous works extracted and utilized an effective representation of drugs from a single perspective. These methods merely considered the topological information of drug in the biological entity network, or combined the association information (e.g. knowledge graph KG) between drug and other biomarkers, or only used the chemical structure or sequence information of drug. Consequently, to jointly learn drug features from both the macroscopic biological network and the microscopic drug molecules. We propose a hybrid embedding graph neural network model named idse-HE, which integrates graph embedding module and node embedding module. idse-HE can fuse the drug chemical structure information, the drug substructure sequence information and the drug network topology information. Our model deems the final representation of drugs and side effects as two implicit factors to reconstruct the original matrix and predicts the potential side effects of drugs. In the robustness experiment, idse-HE shows stable performance in all indicators. We reproduce the baselines under the same conditions, and the experimental results indicate that idse-HE is superior to other advanced methods. Finally, we also collect evidence to confirm several real drug side effect pairs in the predicted results, which were previously regarded as negative samples. More detailed information, scientific researchers can access the user-friendly web-server of idse-HE at http://bioinfo.jcu.edu.cn/idse-HE. In this server, users can obtain the original data and source code, and will be guided to reproduce the model results.
Asunto(s)
Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Redes Neurales de la Computación , Desarrollo de Medicamentos , Humanos , Conocimiento , Programas InformáticosRESUMEN
N6,2'-O-dimethyladenosine (m6Am) is a post-transcriptional modification that may be associated with regulatory roles in the control of cellular functions. Therefore, it is crucial to accurately identify transcriptome-wide m6Am sites to understand underlying m6Am-dependent mRNA regulation mechanisms and biological functions. Here, we used three sequence-based feature-encoding schemes, including one-hot, nucleotide chemical property (NCP), and nucleotide density (ND), to represent RNA sequence samples. Additionally, we proposed an ensemble deep learning framework, named DLm6Am, to identify m6Am sites. DLm6Am consists of three similar base classifiers, each of which contains a multi-head attention module, an embedding module with two parallel deep learning sub-modules, a convolutional neural network (CNN) and a Bi-directional long short-term memory (BiLSTM), and a prediction module. To demonstrate the superior performance of our model's architecture, we compared multiple model frameworks with our method by analyzing the training data and independent testing data. Additionally, we compared our model with the existing state-of-the-art computational methods, m6AmPred and MultiRM. The accuracy (ACC) for the DLm6Am model was improved by 6.45% and 8.42% compared to that of m6AmPred and MultiRM on independent testing data, respectively, while the area under receiver operating characteristic curve (AUROC) for the DLm6Am model was increased by 4.28% and 5.75%, respectively. All the results indicate that DLm6Am achieved the best prediction performance in terms of ACC, Matthews correlation coefficient (MCC), AUROC, and the area under precision and recall curves (AUPR). To further assess the generalization performance of our proposed model, we implemented chromosome-level leave-out cross-validation, and found that the obtained AUROC values were greater than 0.83, indicating that our proposed method is robust and can accurately predict m6Am sites.
Asunto(s)
Algoritmos , Aprendizaje Profundo , Secuencia de Bases , Nucleótidos , ARN Mensajero/genéticaRESUMEN
N6-methyladenosine (m6A) is the most abundant within eukaryotic messenger RNA modification, which plays an essential regulatory role in the control of cellular functions and gene expression. However, it remains an outstanding challenge to detect mRNA m6A transcriptome-wide at base resolution via experimental approaches, which are generally time-consuming and expensive. Developing computational methods is a good strategy for accurate in silico detection of m6A modification sites from the large amount of RNA sequence data. Unfortunately, the existing computational models are usually only for m6A site prediction in a single species, without considering the tissue level of species, while most of them are constructed based on low-confidence level data generated by an m6A antibody immunoprecipitation (IP)-based sequencing method, thereby restricting reliability and generalizability of proposed models. Here, we review recent advances in computational prediction of m6A sites and construct a new computational approach named im6APred using ensemble deep learning to accurately identify m6A sites based on high-confidence level data in multiple tissues of mammals. Our model im6APred builds upon a comprehensive evaluation of multiple classification methods, including four traditional classification algorithms and three deep learning methods and their ensembles. The optimal base-classifier combinations are then chosen by five-fold cross-validation test to achieve an effective stacked model. Our model im6APred can produce the area under the receiver operating characteristic curve (AUROC) in the range of 0.82-0.91 on independent tests, indicating that our model has the ability to learn general methylation rules on RNA bases and generalize to m6A transcriptome-wide identification. Moreover, AUROCs in the range of 0.77-0.96 were achieved using cross-species/tissues validation on the benchmark dataset, demonstrating differences in predictive performance at the tissue level and the need for constructing tissue-specific models for m6A site prediction.
Asunto(s)
Aprendizaje Profundo , Animales , Reproducibilidad de los Resultados , ARN/metabolismo , Adenosina/genética , Adenosina/metabolismo , Mamíferos/metabolismo , Biología Computacional/métodosRESUMEN
BACKGROUND: G protein-coupled receptors (GPCRs) mediate a variety of important physiological functions, are closely related to many diseases, and constitute the most important target family of modern drugs. Therefore, the research of GPCR analysis and GPCR ligand screening is the hotspot of new drug development. Accurately identifying the GPCR-drug interaction is one of the key steps for designing GPCR-targeted drugs. However, it is prohibitively expensive to experimentally ascertain the interaction of GPCR-drug pairs on a large scale. Therefore, it is of great significance to predict the interaction of GPCR-drug pairs directly from the molecular sequences. With the accumulation of known GPCR-drug interaction data, it is feasible to develop sequence-based machine learning models for query GPCR-drug pairs. RESULTS: In this paper, a new sequence-based method is proposed to identify GPCR-drug interactions. For GPCRs, we use a novel bag-of-words (BoW) model to extract sequence features, which can extract more pattern information from low-order to high-order and limit the feature space dimension. For drug molecules, we use discrete Fourier transform (DFT) to extract higher-order pattern information from the original molecular fingerprints. The feature vectors of two kinds of molecules are concatenated and input into a simple prediction engine distance-weighted K-nearest-neighbor (DWKNN). This basic method is easy to be enhanced through ensemble learning. Through testing on recently constructed GPCR-drug interaction datasets, it is found that the proposed methods are better than the existing sequence-based machine learning methods in generalization ability, even an unconventional method in which the prediction performance was further improved by post-processing procedure (PPP). CONCLUSIONS: The proposed methods are effective for GPCR-drug interaction prediction, and may also be potential methods for other target-drug interaction prediction, or protein-protein interaction prediction. In addition, the new proposed feature extraction method for GPCR sequences is the modified version of the traditional BoW model and may be useful to solve problems of protein classification or attribute prediction. The source code of the proposed methods is freely available for academic research at https://github.com/wp3751/GPCR-Drug-Interaction.
Asunto(s)
Algoritmos , Interacciones Farmacológicas , Receptores Acoplados a Proteínas G/metabolismo , Secuencia de Aminoácidos , Área Bajo la Curva , Bases de Datos de Proteínas , Aprendizaje Automático , Modelos Teóricos , Curva ROC , Programas InformáticosRESUMEN
MOTIVATION: Dihydrouridine (D) is a common RNA post-transcriptional modification found in eukaryotes, bacteria and a few archaea. The modification can promote the conformational flexibility of individual nucleotide bases. And its levels are increased in cancerous tissues. Therefore, it is necessary to detect D in RNA for further understanding its functional roles. Since wet-experimental techniques for the aim are time-consuming and laborious, it is urgent to develop computational models to identify D modification sites in RNA. RESULTS: We constructed a predictor, called iRNAD, for identifying D modification sites in RNA sequence. In this predictor, the RNA samples derived from five species were encoded by nucleotide chemical property and nucleotide density. Support vector machine was utilized to perform the classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test. Furthermore, we performed a series of validations from several aspects and demonstrated the robustness and reliability of the proposed model. AVAILABILITY AND IMPLEMENTATION: A user-friendly web-server called iRNAD can be freely accessible at http://lin-group.cn/server/iRNAD, which will provide convenience and guide to users for further studying D modification.
Asunto(s)
Máquina de Vectores de Soporte , Secuencia de Bases , Biología Computacional , Nucleótidos , ARN , Reproducibilidad de los ResultadosRESUMEN
The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.
Asunto(s)
Secuencia de Bases , Regiones Promotoras Genéticas , Análisis de Secuencia de ADN , Programas Informáticos , Sitio de Iniciación de la Transcripción , Activación TranscripcionalRESUMEN
Investigation into the network of protein-protein interactions (PPIs) will provide valuable insights into the inner workings of cells. Accordingly, it is crucially important to develop an automated method or high-throughput tool that can efficiently predict the PPIs. In this study, a new predictor, called "iPPI-PseAAC(CGR)", was developed by incorporating the information of "chaos game representation" into the PseAAC (Pseudo Amino Acid Composition). The advantage by doing so is that some key sequence-order or sequence-pattern information can be more effectively incorporated during the treatment of the protein pair samples. The operation engine used in this predictor is the random forests algorithm. It has been observed via the cross-validations on the widely used benchmark datasets that the success rates achieved by the proposed predictor are remarkably higher than those by its existing counterparts. For the convenience of the most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iPPI-PseAAC(CGR), by which users can easily get their desired results without the need to go through the detailed mathematics.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Mapas de Interacción de Proteínas , Secuencia de Aminoácidos , Aminoácidos , Dinámicas no LinealesRESUMEN
BACKGROUND: Ubiquitination, as a post-translational modification, is a crucial biological process in cell signaling, apoptosis, and localization. Identification of ubiquitination proteins is of fundamental importance for understanding the molecular mechanisms in biological systems and diseases. Although high-throughput experimental studies using mass spectrometry have identified many ubiquitination proteins and ubiquitination sites, the vast majority of ubiquitination proteins remain undiscovered, even in well-studied model organisms. OBJECTIVE: To reduce experimental costs, computational methods have been introduced to predict ubiquitination sites, but the accuracy is unsatisfactory. If it can be predicted whether a protein can be ubiquitinated or not, it will help in predicting ubiquitination sites. However, all the computational methods so far can only predict ubiquitination sites. METHODS: In this study, the first computational method for predicting ubiquitination proteins without relying on ubiquitination site prediction has been developed. The method extracts features from sequence conservation information through a grey system model, as well as functional domain annotation and subcellular localization. RESULTS: Together with the feature analysis and application of the relief feature selection algorithm, the results of 5-fold cross-validation on three datasets achieved a high accuracy of 90.13%, with Matthew's correlation coefficient of 80.34%. The predicted results on an independent test data achieved 87.71% as accuracy and 75.43% of Matthew's correlation coefficient, better than the prediction from the best ubiquitination site prediction tool available. CONCLUSION: Our study may guide experimental design and provide useful insights for studying the mechanisms and modulation of ubiquitination pathways. The code is available at: https://github.com/Chunhuixu/UBIPredic_QWRCHX.
RESUMEN
Lysine crotonylation (Kcr) is an evolution-conserved histone posttranslational modification (PTM), occurring in both human somatic and mouse male germ cell genomes. It is important for male germ cell differentiation. Information of Kcr sites in proteins is very useful for both basic research and drug development. But it is time-consuming and expensive to determine them by experiments alone. Here, we report a novel predictor called iKcr-PseEns that is established by incorporating five tiers of amino acid pairwise couplings into the general pseudo amino acid composition. It has been observed via rigorous cross-validations that the new predictor's sensitivity (Sn), specificity (Sp), accuracy (Acc), and stability (MCC) are 90.53%, 95.27%, 94.49%, and 0.826, respectively. For the convenience of most experimental scientists, a user-friendly web-server for iKcr-PseEns has been established at http://www.jci-bioinfo.cn/iKcr-PseEns, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.
Asunto(s)
Histonas/metabolismo , Procesamiento Proteico-Postraduccional , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Crotonatos/química , Crotonatos/metabolismo , Histonas/química , Humanos , Lisina/química , Lisina/metabolismoRESUMEN
MOTIVATION: Computational methods for phosphorylation site prediction play important roles in protein function studies and experimental design. Most existing methods are based on feature extraction, which may result in incomplete or biased features. Deep learning as the cutting-edge machine learning method has the ability to automatically discover complex representations of phosphorylation patterns from the raw sequences, and hence it provides a powerful tool for improvement of phosphorylation site prediction. RESULTS: We present MusiteDeep, the first deep-learning framework for predicting general and kinase-specific phosphorylation sites. MusiteDeep takes raw sequence data as input and uses convolutional neural networks with a novel two-dimensional attention mechanism. It achieves over a 50% relative improvement in the area under the precision-recall curve in general phosphorylation site prediction and obtains competitive results in kinase-specific prediction compared to other well-known tools on the benchmark data. AVAILABILITY AND IMPLEMENTATION: MusiteDeep is provided as an open-source tool available at https://github.com/duolinwang/MusiteDeep. CONTACT: xudong@missouri.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Aprendizaje Automático , Fosfoproteínas/química , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Redes Neurales de la Computación , Fosforilación , Proteínas Quinasas/metabolismo , Proteínas/metabolismoRESUMEN
MOTIVATION: Post-translational modification, abbreviated as PTM, refers to the change of the amino acid side chains of a protein after its biosynthesis. Owing to its significance for in-depth understanding various biological processes and developing effective drugs, prediction of PTM sites in proteins have currently become a hot topic in bioinformatics. Although many computational methods were established to identify various single-label PTM types and their occurrence sites in proteins, no method has ever been developed for multi-label PTM types. As one of the most frequently observed PTMs, the K-PTM, namely, the modification occurring at lysine (K), can be usually accommodated with many different types, such as 'acetylation', 'crotonylation', 'methylation' and 'succinylation'. Now we are facing an interesting challenge: given an uncharacterized protein sequence containing many K residues, which ones can accommodate two or more types of PTM, which ones only one, and which ones none? RESULTS: To address this problem, a multi-label predictor called IPTM-MLYS: has been developed. It represents the first multi-label PTM predictor ever established. The novel predictor is featured by incorporating the sequence-coupled effects into the general PseAAC, and by fusing an array of basic random forest classifiers into an ensemble system. Rigorous cross-validations via a set of multi-label metrics indicate that the first multi-label PTM predictor is very promising and encouraging. AVAILABILITY AND IMPLEMENTATION: For the convenience of most experimental scientists, a user-friendly web-server for iPTM-mLys has been established at http://www.jci-bioinfo.cn/iPTM-mLys, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. CONTACT: wqiu@gordonlifescience.org, xxiao@gordonlifescience.org, kcchou@gordonlifescience.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Lisina , Procesamiento Proteico-Postraduccional , Aminoácidos , Animales , Humanos , Proteínas/metabolismoRESUMEN
Just like PTM or PTLM (post-translational modification) in proteins, PTCM (post-transcriptional modification) in RNA plays very important roles in biological processes. Occurring at adenine (A) with the genetic code motif (GAC), N(6)-methyldenosine (m(6)A) is one of the most common and abundant PTCMs in RNA found in viruses and most eukaryotes. Given an uncharacterized RNA sequence containing many GAC motifs, which of them can be methylated, and which cannot? It is important for both basic research and drug development to address this problem. Particularly with the avalanche of RNA sequences generated in the postgenomic age, it is highly demanded to develop computational methods for timely identifying the N(6)-methyldenosine sites in RNA. Here we propose a new predictor called pRNAm-PC, in which RNA sequence samples are expressed by a novel mode of pseudo dinucleotide composition (PseDNC) whose components were derived from a physical-chemical matrix via a series of auto-covariance and cross covariance transformations. It was observed via a rigorous jackknife test that, in comparison with the existing predictor for the same purpose, pRNAm-PC achieved remarkably higher success rates in both overall accuracy and stability, indicating that the new predictor will become a useful high-throughput tool for identifying methylation sites in RNA, and that the novel approach can also be used to study many other RNA-related problems and conduct genome analysis. A user-friendly Web server for pRNAm-PC has been established at http://www.jci-bioinfo.cn/pRNAm-PC, by which users can easily get their desired results without needing to go through the mathematical details.
Asunto(s)
Adenosina/análogos & derivados , Procesamiento Postranscripcional del ARN , ARN/química , Adenosina/análisis , Adenosina/genética , Algoritmos , Secuencia de Bases , Modelos Genéticos , ARN/genética , ARN de Hongos/química , ARN de Hongos/genética , Saccharomyces cerevisiae/química , Saccharomyces cerevisiae/genética , Programas Informáticos , Máquina de Vectores de SoporteRESUMEN
Enzymes play pivotal roles in most of the biological reaction. The catalytic residues of an enzyme are defined as the amino acids which are directly involved in chemical catalysis; the knowledge of these residues is important for understanding enzyme function. Given an enzyme, which residues are the catalytic sites, and which residues are not? This is the first important problem for in-depth understanding the catalytic mechanism and drug development. With the explosive of protein sequences generated during the post-genomic era, it is highly desirable for both basic research and drug design to develop fast and reliable method for identifying the catalytic sites of enzymes according to their sequences. To address this problem, we proposed a new predictor, called iCataly-PseAAC. In the prediction system, the peptide sample was formulated with sequence evolution information via grey system model GM(2,1). It was observed by the rigorous jackknife test and independent dataset test that iCataly-PseAAC was superior to exist predictions though its only use sequence information. As a user-friendly web server, iCataly-PseAAC is freely accessible at http://www.jci-bioinfo.cn/iCataly-PseAAC. A step-by-step guide has been provided on how to use the web server to get the desired results for the convenience of most experimental scientists.
Asunto(s)
Dominio Catalítico , Biología Computacional/métodos , Enzimas/química , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Aminoácidos , Enzimas/genética , Enzimas/metabolismo , Evolución Molecular , Curva ROC , Reproducibilidad de los Resultados , Navegador WebRESUMEN
Predominantly occurring on cytosine, DNA methylation is a process by which cells can modify their DNAs to change the expression of gene products. It plays very important roles in life development but also in forming nearly all types of cancer. Therefore, knowledge of DNA methylation sites is significant for both basic research and drug development. Given an uncharacterized DNA sequence containing many cytosine residues, which one can be methylated and which one cannot? With the avalanche of DNA sequences generated during the postgenomic age, it is highly desired to develop computational methods for accurately identifying the methylation sites in DNA. Using the trinucleotide composition, pseudo amino acid components, and a dataset-optimizing technique, we have developed a new predictor called "iDNA-Methyl" that has achieved remarkably higher success rates in identifying the DNA methylation sites than the existing predictors. A user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iDNA-Methyl, where users can easily get their desired results. We anticipate that the web-server predictor will become a very useful high-throughput tool for basic research and drug development and that the novel approach and technique can also be used to investigate many other DNA-related problems and genome analysis.
Asunto(s)
Biología Computacional/métodos , Metilación de ADN/genética , Nucleótidos/metabolismo , Programas Informáticos , Aminoácidos/metabolismo , Secuencia de Bases , Codón/genética , Bases de Datos Genéticas , Humanos , Internet , Curva ROC , Reproducibilidad de los Resultados , Máquina de Vectores de SoporteRESUMEN
Meiosis and recombination are the two opposite aspects that coexist in a DNA system. As a driving force for evolution by generating natural genetic variations, meiotic recombination plays a very important role in the formation of eggs and sperm. Interestingly, the recombination does not occur randomly across a genome, but with higher probability in some genomic regions called "hotspots", while with lower probability in so-called "coldspots". With the ever-increasing amount of genome sequence data in the postgenomic era, computational methods for effectively identifying the hotspots and coldspots have become urgent as they can timely provide us with useful insights into the mechanism of meiotic recombination and the process of genome evolution as well. To meet the need, we developed a new predictor called "iRSpot-TNCPseAAC", in which a DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the protein translated from the DNA sample according to its genetic codes. The former was used to incorporate its local or short-rage sequence order information; while the latter, its global and long-range one. Compared with the best existing predictor in this area, iRSpot-TNCPseAAC achieved higher rates in accuracy, Mathew's correlation coefficient, and sensitivity, indicating that the new predictor may become a useful tool for identifying the recombination hotspots and coldspots, or, at least, become a complementary tool to the existing methods. It has not escaped our notice that the aforementioned novel approach to incorporate the DNA sequence order information into a discrete model may also be used for many other genome analysis problems. The web-server for iRSpot-TNCPseAAC is available at http://www.jci-bioinfo.cn/iRSpot-TNCPseAAC. Furthermore, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the current web server to obtain their desired result without the need to follow the complicated mathematical equations.