Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Comput Biol Med ; 172: 108227, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38460308

RESUMO

Accurately predicting protein-ATP binding residues is critical for protein function annotation and drug discovery. Computational methods dedicated to the prediction of binding residues based on protein sequence information have exhibited notable advancements in predictive accuracy. Nevertheless, these methods continue to grapple with several formidable challenges, including limited means of extracting more discriminative features and inadequate algorithms for integrating protein and residue information. To address the problems, we propose ATP-Deep, a novel protein-ATP binding residues predictor. ATP-Deep harnesses the capabilities of unsupervised pre-trained language models and incorporates domain-specific evolutionary context information from homologous sequences. It further refines the embedding at the residue level through integration with corresponding protein-level information and employs a contextual-based co-attention mechanism to adeptly fuse multiple sources of features. The performance evaluation results on the benchmark datasets reveal that ATP-Deep achieves an AUC of 0.954 and 0.951, respectively, surpassing the performance of the state-of-the-art model. These findings underscore the effectiveness of assimilating protein-level information and deploying a contextual-based co-attention mechanism grounded in context to bolster the prediction performance of protein-ATP binding residues.


Assuntos
Algoritmos , Proteínas , Ligação Proteica , Proteínas/química , Sequência de Aminoácidos , Trifosfato de Adenosina
2.
Bioinformatics ; 39(12)2023 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-37995291

RESUMO

MOTIVATION: RNA N6-methyladenosine (m6A) in Homo sapiens plays vital roles in a variety of biological functions. Precise identification of m6A modifications is thus essential to elucidation of their biological functions and underlying molecular-level mechanisms. Currently available high-throughput single-nucleotide-resolution m6A modification data considerably accelerated the identification of RNA modification sites through the development of data-driven computational methods. Nevertheless, existing methods have limitations in terms of the coverage of single-nucleotide-resolution cell lines and have poor capability in model interpretations, thereby having limited applicability. RESULTS: In this study, we present CLSM6A, comprising a set of deep learning-based models designed for predicting single-nucleotide-resolution m6A RNA modification sites across eight different cell lines and three tissues. Extensive benchmarking experiments are conducted on well-curated datasets and accordingly, CLSM6A achieves superior performance than current state-of-the-art methods. Furthermore, CLSM6A is capable of interpreting the prediction decision-making process by excavating critical motifs activated by filters and pinpointing highly concerned positions in both forward and backward propagations. CLSM6A exhibits better portability on similar cross-cell line/tissue datasets, reveals a strong association between highly activated motifs and high-impact motifs, and demonstrates complementary attributes of different interpretation strategies. AVAILABILITY AND IMPLEMENTATION: The webserver is available at http://csbio.njust.edu.cn/bioinf/clsm6a. The datasets and code are available at https://github.com/zhangying-njust/CLSM6A/.


Assuntos
Nucleotídeos , RNA , Humanos , RNA/metabolismo , Adenosina/genética , Adenosina/metabolismo , Análise de Sequência de RNA/métodos
3.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35907779

RESUMO

Circular RNA (circRNA) is closely involved in physiological and pathological processes of many diseases. Discovering the associations between circRNAs and diseases is of great significance. Due to the high-cost to verify the circRNA-disease associations by wet-lab experiments, computational approaches for predicting the associations become a promising research direction. In this paper, we propose a method, MDGF-MCEC, based on multi-view dual attention graph convolution network (GCN) with cooperative ensemble learning to predict circRNA-disease associations. First, MDGF-MCEC constructs two disease relation graphs and two circRNA relation graphs based on different similarities. Then, the relation graphs are fed into a multi-view GCN for representation learning. In order to learn high discriminative features, a dual-attention mechanism is introduced to adjust the contribution weights, at both channel level and spatial level, of different features. Based on the learned embedding features of diseases and circRNAs, nine different feature combinations between diseases and circRNAs are treated as new multi-view data. Finally, we construct a multi-view cooperative ensemble classifier to predict the associations between circRNAs and diseases. Experiments conducted on the CircR2Disease database demonstrate that the proposed MDGF-MCEC model achieves a high area under curve of 0.9744 and outperforms the state-of-the-art methods. Promising results are also obtained from experiments on the circ2Disease and circRNADisease databases. Furthermore, the predicted associated circRNAs for hepatocellular carcinoma and gastric cancer are supported by the literature. The code and dataset of this study are available at https://github.com/ABard0/MDGF-MCEC.


Assuntos
RNA Circular , Neoplasias Gástricas , Humanos , Peptídeos e Proteínas de Sinalização Intercelular , Aprendizado de Máquina , Neoplasias Gástricas/genética
4.
J Chem Inf Model ; 62(7): 1794-1807, 2022 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-35353532

RESUMO

Characteristic gene selection and tumor classification of gene expression data play major roles in genomic research. Due to the characteristics of a small sample size and high dimensionality of gene expression data, it is a common practice to perform dimensionality reduction prior to the use of machine learning-based methods to analyze the expression data. In this context, classical principal component analysis (PCA) and its improved versions have been widely used. Recently, methods based on supervised discriminative sparse PCA have been developed to improve the performance of data dimensionality reduction. However, such methods still have limitations: most of them have not taken into consideration the improvement of robustness to outliers and noise, label information, sparsity, as well as capturing intrinsic geometrical structures in one objective function. To address this drawback, in this study, we propose a novel PCA-based method, known as the robust Laplacian supervised discriminative sparse PCA, termed RLSDSPCA, which enforces the L2,1 norm on the error function and incorporates the graph Laplacian into supervised discriminative sparse PCA. To evaluate the efficacy of the proposed RLSDSPCA, we applied it to the problems of characteristic gene selection and tumor classification problems using gene expression data. The results demonstrate that the proposed RLSDSPCA method, when used in combination with other related methods, can effectively identify new pathogenic genes associated with diseases. In addition, RLSDSPCA has also achieved the best performance compared with the state-of-the-art methods on tumor classification in terms of major performance metrics. The codes and data sets used in the study are freely available at http://csbio.njust.edu.cn/bioinf/rlsdspca/.


Assuntos
Algoritmos , Neoplasias , Genômica , Humanos , Neoplasias/genética , Análise de Componente Principal
5.
Bioinformatics ; 38(2): 556-558, 2022 01 03.
Artigo em Inglês | MEDLINE | ID: mdl-34546290

RESUMO

MOTIVATION: Accurately identifying protein-ATP binding poses is significantly valuable for both basic structure biology and drug discovery. Although many docking methods have been designed, most of them require a user-defined binding site and are difficult to achieve a high-quality protein-ATP docking result. It is critical to develop a protein-ATP-specific blind docking method without user-defined binding sites. RESULTS: Here, we present ATPdock, a template-based method for docking ATP into protein. For each query protein, if no pocket site is given, ATPdock first identifies its most potential pocket using ATPbind, an ATP-binding site predictor; then, the template pocket, which is most similar to the given or identified pocket, is searched from the database of pocket-ligand structures using APoc, a pocket structural alignment tool; thirdly, the rough docking pose of ATP (rdATP) is generated using LS-align, a ligand structural alignment tool, to align the initial ATP pose to the template ligand corresponding to template pocket; finally, the Metropolis Monte Carlo simulation is used to fine-tune the rdATP under the guidance of AutoDock Vina energy function. Benchmark tests show that ATPdock significantly outperforms other state-of-the-art methods in docking accuracy. AVAILABILITY AND IMPLEMENTATION: https://jun-csbio.github.io/atpdock/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Trifosfato de Adenosina , Proteínas , Ligantes , Proteínas/química , Sítios de Ligação , Ligação Proteica , Trifosfato de Adenosina/metabolismo , Simulação de Acoplamento Molecular
6.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34953462

RESUMO

More than 6000 human diseases have been recorded to be caused by non-synonymous single nucleotide polymorphisms (nsSNPs). Rapid and accurate prediction of pathogenic nsSNPs can improve our understanding of the principle and design of new drugs, which remains an unresolved challenge. In the present work, a new computational approach, termed MSRes-MutP, is proposed based on ResNet blocks with multi-scale kernel size to predict disease-associated nsSNPs. By feeding the serial concatenation of the extracted four types of features, the performance of MSRes-MutP does not obviously improve. To address this, a second model FFMSRes-MutP is developed, which utilizes deep feature fusion strategy and multi-scale 2D-ResNet and 1D-ResNet blocks to extract relevant two-dimensional features and physicochemical properties. FFMSRes-MutP with the concatenated features achieves a better performance than that with individual features. The performance of FFMSRes-MutP is benchmarked on five different datasets. It achieves the Matthew's correlation coefficient (MCC) of 0.593 and 0.618 on the PredictSNP and MMP datasets, which are 0.101 and 0.210 higher than that of the existing best method PredictSNP1. When tested on the HumDiv and HumVar datasets, it achieves MCC of 0.9605 and 0.9507, and area under curve (AUC) of 0.9796 and 0.9748, which are 0.1747 and 0.2669, 0.0853 and 0.1335, respectively, higher than the existing best methods PolyPhen-2 and FATHMM (weighted). In addition, on blind test using a third-party dataset, FFMSRes-MutP performs as the second-best predictor (with MCC and AUC of 0.5215 and 0.7633, respectively), when compared with the other four predictors. Extensive benchmarking experiments demonstrate that FFMSRes-MutP achieves effective feature fusion and can be explored as a useful approach for predicting disease-associated nsSNPs. The webserver is freely available at http://csbio.njust.edu.cn/bioinf/ffmsresmutp/ for academic use.


Assuntos
Aprendizado Profundo , Doença/genética , Polimorfismo de Nucleotídeo Único , Algoritmos , Área Sob a Curva , Microambiente Celular , Biologia Computacional/métodos , Humanos , Mutação , Preparações Farmacêuticas
7.
Comput Struct Biotechnol J ; 19: 6400-6416, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34938415

RESUMO

Transmembrane proteins have critical biological functions and play a role in a multitude of cellular processes including cell signaling, transport of molecules and ions across membranes. Approximately 60% of transmembrane proteins are considered as drug targets. Missense mutations in such proteins can lead to many diverse diseases and disorders, such as neurodegenerative diseases and cystic fibrosis. However, there are limited studies on mutations in transmembrane proteins. In this work, we first design a new feature encoding method, termed weight attenuation position-specific scoring matrix (WAPSSM), which builds upon the protein evolutionary information. Then, we propose a new mutation prediction algorithm (cascade XGBoost) by leveraging the idea learned from consensus predictors and gcForest. Multi-level experiments illustrate the effectiveness of WAPSSM and cascade XGBoost algorithms. Finally, based on WAPSSM and other three types of features, in combination with the cascade XGBoost algorithm, we develop a new transmembrane protein mutation predictor, named MutTMPredictor. We benchmark the performance of MutTMPredictor against several existing predictors on seven datasets. On the 546 mutations dataset, MutTMPredictor achieves the accuracy (ACC) of 0.9661 and the Matthew's Correlation Coefficient (MCC) of 0.8950. While on the 67,584 dataset, MutTMPredictor achieves an MCC of 0.7523 and area under curve (AUC) of 0.8746, which are 0.1625 and 0.0801 respectively higher than those of the existing best predictor (fathmm). Besides, MutTMPredictor also outperforms two specific predictors on the Pred-MutHTP datasets. The results suggest that MutTMPredictor can be used as an effective method for predicting and prioritizing missense mutations in transmembrane proteins. The MutTMPredictor webserver and datasets are freely accessible at http://csbio.njust.edu.cn/bioinf/muttmpredictor/ for academic use.

8.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34459479

RESUMO

DNA N6-methyladenine is an important type of DNA modification that plays important roles in multiple biological processes. Despite the recent progress in developing DNA 6mA site prediction methods, several challenges remain to be addressed. For example, although the hand-crafted features are interpretable, they contain redundant information that may bias the model training and have a negative impact on the trained model. Furthermore, although deep learning (DL)-based models can perform feature extraction and classification automatically, they lack the interpretability of the crucial features learned by those models. As such, considerable research efforts have been focused on achieving the trade-off between the interpretability and straightforwardness of DL neural networks. In this study, we develop two new DL-based models for improving the prediction of N6-methyladenine sites, termed LA6mA and AL6mA, which use bidirectional long short-term memory to respectively capture the long-range information and self-attention mechanism to extract the key position information from DNA sequences. The performance of the two proposed methods is benchmarked and evaluated on the two model organisms Arabidopsis thaliana and Drosophila melanogaster. On the two benchmark datasets, LA6mA achieves an area under the receiver operating characteristic curve (AUROC) value of 0.962 and 0.966, whereas AL6mA achieves an AUROC value of 0.945 and 0.941, respectively. Moreover, an in-depth analysis of the attention matrix is conducted to interpret the important information, which is hidden in the sequence and relevant for 6mA site prediction. The two novel pipelines developed for DNA 6mA site prediction in this work will facilitate a better understanding of the underlying principle of DL-based DNA methylation site prediction and its future applications.


Assuntos
Adenosina/análogos & derivados , Biologia Computacional/métodos , Metilação de DNA , DNA/genética , Epigenômica/métodos , DNA/química , Aprendizado Profundo
9.
Anal Biochem ; 626: 114241, 2021 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-33971164

RESUMO

Knowledge of protein-ATP interaction can help for protein functional annotation and drug discovery. Accurately identifying protein-ATP binding residues is an important but challenging task to gain the knowledge of protein-ATP interactions, especially for the case where only protein sequence information is given. In this study, we propose a novel method, named DeepATPseq, to predict protein-ATP binding residues without using any information about protein three-dimension structure or sequence-derived structural information. In DeepATPseq, the HHBlits-generated position-specific frequency matrix (PSFM) profile is first employed to extract the feature information of each residue. Then, for each residue, the PSFM-based feature is fed into two prediction models, which are generated by the algorithms of deep convolutional neural network (DCNN) and support vector machine (SVM) separately. The final ATP-binding probability of the corresponding residue is calculated by the weighted sum of the outputted values of DCNN-based and SVM-based models. Experimental results on the independent validation data set demonstrate that DeepATPseq could achieve an accuracy of 77.71%, covering 57.42% of all ATP-binding residues, while achieving a Matthew's correlation coefficient value (0.655) that is significantly higher than that of existing sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis show that the major advantage of DeepATPseq lies at the combination utilization of DCNN and SVM that helps dig out more discriminative information from the PSFM profiles. The online server and standalone package of DeepATPseq are freely available at: https://jun-csbio.github.io/DeepATPseq/for academic use.


Assuntos
Trifosfato de Adenosina/metabolismo , Algoritmos , Biologia Computacional/métodos , Redes Neurais de Computação , Proteínas/metabolismo , Humanos , Ligação Proteica , Proteínas/química
10.
Anal Biochem ; 612: 113955, 2021 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32949607

RESUMO

Phosphorylation is a ubiquitous type of post-translational modification (PTM) that occurs in both eukaryotic and prokaryotic cells where in a phosphate group binds with amino acid residues. These specific residues, i.e., serine (S), threonine (T), and tyrosine (Y), exhibit diverse functions at the molecular level. Recent studies have determined that some diseases such as cancer, diabetes, and neurodegenerative diseases are caused by abnormal phosphorylation. Based on its potential applications in biological research and drug development, the large-scale identification of phosphorylation sites has attracted interest. Existing wet-lab technologies for targeting phosphorylation sites are overpriced and time consuming. Thus, computational algorithms that can efficiently accelerate the annotation of phosphorylation sites from massive protein sequences are needed. Numerous machine learning-based methods have been implemented for phosphorylation sites prediction. However, despite extensive efforts, existing computational approaches continue to have inadequate performance, particularly in terms of overall ACC, MCC, and AUC. In this paper, we report a novel deep learning-based predictor to overcome these performance hurdles, DeepPPSite, which was constructed using a stacked long short-term memory recurrent network for predicting phosphorylation sites. The proposed technique expediently learns the protein representations from conjoint protein descriptors. The experimental results indicated that our model achieved superior performance on the training dataset for S, T and Y, with MCC values of 0.608, 0.602, and 0.558, respectively, using a 10-fold cross-validation test. We further determined the generalization efficacy of the proposed predictor DeepPPSite by conducting a rigorous independent test. The predictive MCC values were 0.358, 0.356, and 0.350 for the S, T, and Y phosphorylation sites, respectively. Rigorous cross-validation and independent validation tests for the three types of phosphorylation sites demonstrated that the designed DeepPPSite tool significantly outperforms state-of-the-art methods.


Assuntos
Biologia Computacional/métodos , Processamento de Proteína Pós-Traducional , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Bases de Dados de Proteínas , Aprendizado Profundo , Modelos Estatísticos , Fosforilação , Curva ROC , Serina/química , Serina/metabolismo , Treonina/química , Treonina/metabolismo , Tirosina/química , Tirosina/metabolismo
11.
J Cancer Res Ther ; 15(7): 1450-1463, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31939422

RESUMO

Gastrectomy is considered the gold standard treatment for gastric cancer patients. Currently, there are two minimally invasive surgical methods to choose from, robotic gastrectomy (RG) and laparoscopic gastrectomy (LG). Nevertheless, it is still unclear which is superior between the two. This meta-analysis aimed to investigate the effectiveness and safety of RG and LG for gastric cancer. A systematic literature search was performed using PubMed, Embase, and the Cochrane Library databases until September 2018 in studies that compared RG and LG in gastric cancer patients. Operative and postoperative outcomes analyzed were assessed. The quality of the evidence was rated using the Grading of Recommendations, Assessment, Development and Evaluations. Twenty-four English studies were analyzed. The meta-analysis revealed that the RG group had a significantly longer operation time, lower intraoperative blood loss, and higher perioperative costs compared to the LG group. However, there were no differences in complications, conversion rate, reoperation rate, mortality, number of lymph nodes harvested, days of first flatus, postoperative hospitalization time, and survival rate between the two groups. RG was shown to be associated with decreased intraoperative blood loss and increased perioperative cost and operation time compared to LG. Several higher-quality original studies and prospective clinical trials are required to confirm the advantages of RG.


Assuntos
Gastrectomia , Laparoscopia , Procedimentos Cirúrgicos Robóticos , Neoplasias Gástricas/cirurgia , Fatores Etários , Perda Sanguínea Cirúrgica , Índice de Massa Corporal , Flatulência , Gastrectomia/efeitos adversos , Gastrectomia/métodos , Custos de Cuidados de Saúde , Humanos , Laparoscopia/efeitos adversos , Laparoscopia/métodos , Tempo de Internação , Duração da Cirurgia , Complicações Pós-Operatórias/etiologia , Procedimentos Cirúrgicos Robóticos/efeitos adversos , Procedimentos Cirúrgicos Robóticos/métodos , Neoplasias Gástricas/diagnóstico , Neoplasias Gástricas/mortalidade , Resultado do Tratamento
12.
Comb Chem High Throughput Screen ; 21(9): 631-645, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30569852

RESUMO

AIM AND OBJECTIVE: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. MATERIALS AND METHODS: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. RESULTS: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. CONCLUSION: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.


Assuntos
Algoritmos , Expressão Gênica , Neoplasias/classificação , Biomarcadores Tumorais/genética , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Humanos , Neoplasias/genética , Redes Neurais de Computação , Análise de Sequência com Séries de Oligonucleotídeos
13.
J Chem Inf Model ; 58(2): 501-510, 2018 02 26.
Artigo em Inglês | MEDLINE | ID: mdl-29361215

RESUMO

Protein-ATP interactions are ubiquitous in a wide variety of biological processes. Correctly locating ATP binding sites from protein information is an important but challenging task for protein function annotation and drug discovery. However, there is no method that can optimally identify ATP binding sites for different proteins. In this study, we report a new composite predictor, ATPbind, for ATP binding sites by integrating the outputs of two template-based predictors (i.e., S-SITE and TM-SITE) and three discriminative sequence-driven features of proteins: position specific scoring matrix, predicted secondary structure, and predicted solvent accessibility. In ATPbind, we assembled multiple support vector machines (SVMs) based on a random undersampling technique to cope with the serious imbalance phenomenon between the numbers of ATP binding sites and of non-ATP binding sites. We also constructed a new gold-standard benchmark data set consisting of 429 ATP binding proteins from the PDB database to evaluate and compare the proposed ATPbind with other existing predictors. Starting from a query sequence and predicted I-TASSER models, ATPbind can achieve an average accuracy of 72%, covering 62% of all ATP binding sites while achieving a Matthews correlation coefficient value that is significantly higher than that of other state-of-the-art predictors.


Assuntos
Trifosfato de Adenosina/metabolismo , Proteínas/metabolismo , Máquina de Vetores de Suporte , Sítios de Ligação , Bases de Dados de Proteínas , Estrutura Secundária de Proteína , Proteínas/química , Solventes/química
14.
IEEE/ACM Trans Comput Biol Bioinform ; 15(4): 1315-1324, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-28600258

RESUMO

Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.


Assuntos
Algoritmos , Biologia Computacional/métodos , Neoplasias/classificação , Transcriptoma/genética , Análise por Conglomerados , Bases de Dados Genéticas , Humanos , Neoplasias/genética , Neoplasias/metabolismo , Aprendizado de Máquina Supervisionado
15.
IEEE Trans Nanobioscience ; 15(7): 674-682, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27552763

RESUMO

As one of the most ubiquitous post-transcriptional modifications of RNA, N6-methyladenosine ( [Formula: see text]) plays an essential role in many vital biological processes. The identification of [Formula: see text] sites in RNAs is significantly important for both basic biomedical research and practical drug development. In this study, we designed a computational-based method, called TargetM6A, to rapidly and accurately target [Formula: see text] sites solely from the primary RNA sequences. Two new features, i.e., position-specific nucleotide/dinucleotide propensities (PSNP/PSDP), are introduced and combined with the traditional nucleotide composition (NC) feature to formulate RNA sequences. The extracted features are further optimized to obtain a much more compact and discriminative feature subset by applying an incremental feature selection (IFS) procedure. Based on the optimized feature subset, we trained TargetM6A on the training dataset with a support vector machine (SVM) as the prediction engine. We compared the proposed TargetM6A method with existing methods for predicting [Formula: see text] sites by performing stringent jackknife tests and independent validation tests on benchmark datasets. The experimental results show that the proposed TargetM6A method outperformed the existing methods for predicting [Formula: see text] sites and remarkably improved the prediction performances, with MCC = 0.526 and AUC = 0.818. We also provided a user-friendly web server for TargetM6A, which is publicly accessible for academic use at http://csbio.njust.edu.cn/bioinf/TargetM6A.


Assuntos
Adenosina/análogos & derivados , Biologia Computacional/métodos , RNA/química , Análise de Sequência de RNA/métodos , Máquina de Vetores de Suporte , Adenosina/análise , Adenosina/química , RNA/análise , Saccharomyces cerevisiae/genética
16.
Anal Biochem ; 508: 104-13, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27293216

RESUMO

N(6)-methyladenosine (m(6)A) is one of the most common and abundant post-transcriptional RNA modifications found in viruses and most eukaryotes. m(6)A plays an essential role in many vital biological processes to regulate gene expression. Because of its widespread distribution across the genomes, the identification of m(6)A sites from RNA sequences is of significant importance for better understanding the regulatory mechanism of m(6)A. Although progress has been achieved in m(6)A site prediction, challenges remain. This article aims to further improve the performance of m(6)A site prediction by introducing a new heuristic nucleotide physical-chemical property selection (HPCS) algorithm. The proposed HPCS algorithm can effectively extract an optimized subset of nucleotide physical-chemical properties under the prescribed feature representation for encoding an RNA sequence into a feature vector. We demonstrate the efficacy of the proposed HPCS algorithm under different feature representations, including pseudo dinucleotide composition (PseDNC), auto-covariance (AC), and cross-covariance (CC). Based on the proposed HPCS algorithm, we implemented an m(6)A site predictor, called M6A-HPCS, which is freely available at http://csbio.njust.edu.cn/bioinf/M6A-HPCS. Experimental results over rigorous jackknife tests on benchmark datasets demonstrated that the proposed M6A-HPCS achieves higher success rates and outperforms existing state-of-the-art sequence-based m(6)A site predictors.


Assuntos
Adenosina/análogos & derivados , Algoritmos , Nucleotídeos/química , Adenosina/química , Sítios de Ligação , Heurística
17.
Anal Biochem ; 497: 60-7, 2016 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-26748145

RESUMO

Just like PTM or PTLM (post-translational modification) in proteins, PTCM (post-transcriptional modification) in RNA plays very important roles in biological processes. Occurring at adenine (A) with the genetic code motif (GAC), N(6)-methyldenosine (m(6)A) is one of the most common and abundant PTCMs in RNA found in viruses and most eukaryotes. Given an uncharacterized RNA sequence containing many GAC motifs, which of them can be methylated, and which cannot? It is important for both basic research and drug development to address this problem. Particularly with the avalanche of RNA sequences generated in the postgenomic age, it is highly demanded to develop computational methods for timely identifying the N(6)-methyldenosine sites in RNA. Here we propose a new predictor called pRNAm-PC, in which RNA sequence samples are expressed by a novel mode of pseudo dinucleotide composition (PseDNC) whose components were derived from a physical-chemical matrix via a series of auto-covariance and cross covariance transformations. It was observed via a rigorous jackknife test that, in comparison with the existing predictor for the same purpose, pRNAm-PC achieved remarkably higher success rates in both overall accuracy and stability, indicating that the new predictor will become a useful high-throughput tool for identifying methylation sites in RNA, and that the novel approach can also be used to study many other RNA-related problems and conduct genome analysis. A user-friendly Web server for pRNAm-PC has been established at http://www.jci-bioinfo.cn/pRNAm-PC, by which users can easily get their desired results without needing to go through the mathematical details.


Assuntos
Adenosina/análogos & derivados , Processamento Pós-Transcricional do RNA , RNA/química , Adenosina/análise , Adenosina/genética , Algoritmos , Sequência de Bases , Modelos Genéticos , RNA/genética , RNA Fúngico/química , RNA Fúngico/genética , Saccharomyces cerevisiae/química , Saccharomyces cerevisiae/genética , Software , Máquina de Vetores de Suporte
18.
Amino Acids ; 44(5): 1365-79, 2013 May.
Artigo em Inglês | MEDLINE | ID: mdl-23456487

RESUMO

Protein attribute prediction from primary sequences is an important task and how to extract discriminative features is one of the most crucial aspects. Because single-view feature cannot reflect all the information of a protein, fusing multi-view features is considered as a promising route to improve prediction accuracy. In this paper, we propose a novel framework for protein multi-view feature fusion: first, features from different views are parallely combined to form complex feature vectors; Then, we extend the classic principal component analysis to the generalized principle component analysis for further feature extraction from the parallely combined complex features, which lie in a complex space. Finally, the extracted features are used for prediction. Experimental results on different benchmark datasets and machine learning algorithms demonstrate that parallel strategy outperforms the traditional serial approach and is particularly helpful for extracting the core information buried among multi-view feature sets. A web server for protein structural class prediction based on the proposed method (COMSPA) is freely available for academic use at: http://www.csbio.sjtu.edu.cn/bioinf/COMSPA/ .


Assuntos
Proteínas/química , Algoritmos , Inteligência Artificial , Teorema de Bayes , Simulação por Computador , Cistina/química , Modelos Moleculares , Análise de Componente Principal , Estrutura Secundária de Proteína , Software
19.
J Comput Chem ; 34(11): 974-85, 2013 Apr 30.
Artigo em Inglês | MEDLINE | ID: mdl-23288787

RESUMO

Understanding the interactions between proteins and ligands is critical for protein function annotations and drug discovery. We report a new sequence-based template-free predictor (TargetATPsite) to identify the Adenosine-5'-triphosphate (ATP) binding sites with machine-learning approaches. Two steps are implemented in TargetATPsite: binding residues and pockets predictions, respectively. To predict the binding residues, a novel image sparse representation technique is proposed to encode residue evolution information treated as the input features. An ensemble classifier constructed based on support vector machines (SVM) from multiple random under-samplings is used as the prediction model, which is effective for dealing with imbalance phenomenon between the positive and negative training samples. Compared with the existing ATP-specific sequence-based predictors, TargetATPsite is featured by the second step of possessing the capability of further identifying the binding pockets from the predicted binding residues through a spatial clustering algorithm. Experimental results on three benchmark datasets demonstrate the efficacy of TargetATPsite.


Assuntos
Trifosfato de Adenosina/química , Simulação de Dinâmica Molecular , Proteínas/química , Software , Máquina de Vetores de Suporte , Sítios de Ligação , Bases de Dados de Proteínas , Desenho de Fármacos , Humanos , Ligantes , Ligação Proteica , Termodinâmica
20.
BMC Bioinformatics ; 13: 118, 2012 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-22651691

RESUMO

BACKGROUND: Adenosine-5'-triphosphate (ATP) is one of multifunctional nucleotides and plays an important role in cell biology as a coenzyme interacting with proteins. Revealing the binding sites between protein and ATP is significantly important to understand the functionality of the proteins and the mechanisms of protein-ATP complex. RESULTS: In this paper, we propose a novel framework for predicting the proteins' functional residues, through which they can bind with ATP molecules. The new prediction protocol is achieved by combination of sequence evolutional information and bi-profile sampling of multi-view sequential features and the sequence derived structural features. The hypothesis for this strategy is single-view feature can only represent partial target's knowledge and multiple sources of descriptors can be complementary. CONCLUSIONS: Prediction performances evaluated by both 5-fold and leave-one-out jackknife cross-validation tests on two benchmark datasets consisting of 168 and 227 non-homologous ATP binding proteins respectively demonstrate the efficacy of the proposed protocol. Our experimental results also reveal that the residue structural characteristics of real protein-ATP binding sites are significant different from those normal ones, for example the binding residues do not show high solvent accessibility propensities, and the bindings prefer to occur at the conjoint points between different secondary structure segments. Furthermore, results also show that performance is affected by the imbalanced training datasets by testing multiple ratios between positive and negative samples in the experiments. Increasing the dataset scale is also demonstrated useful for improving the prediction performances.


Assuntos
Trifosfato de Adenosina/química , Sítios de Ligação , Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteínas/química , Máquina de Vetores de Suporte
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA