Pesquisa | BVS Violência e Saúde

1.

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction.

Zhu, Yi-Heng; Liu, Zi; Liu, Yan; Ji, Zhiwei; Yu, Dong-Jun.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38349057

RESUMO

Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.

Assuntos

Análise de Dados , Idioma , Sítios de Ligação , Sequência de Aminoácidos , Bases de Dados Factuais

2.

MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network.

Han, Ke; Shen, Long-Chen; Zhu, Yi-Heng; Xu, Jian; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34664074

RESUMO

Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.

Assuntos

Aprendizado Profundo , Sítios de Ligação/genética , Redes Neurais de Computação , Ligação Proteica , Fatores de Transcrição/metabolismo

3.

Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation.

Liu, Yan; Zhu, Yi-Heng; Song, Xiaoning; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33537753

RESUMO

As an essential task in protein structure and function prediction, protein fold recognition has attracted increasing attention. The majority of the existing machine learning-based protein fold recognition approaches strongly rely on handcrafted features, which depict the characteristics of different protein folds; however, effective feature extraction methods still represent the bottleneck for further performance improvement of protein fold recognition. As a powerful feature extractor, deep convolutional neural network (DCNN) can automatically extract discriminative features for fold recognition without human intervention, which has demonstrated an impressive performance on protein fold recognition. Despite the encouraging progress, DCNN often acts as a black box, and as such, it is challenging for users to understand what really happens in DCNN and why it works well for protein fold recognition. In this study, we explore the intrinsic mechanism of DCNN and explain why it works for protein fold recognition using a visual explanation technique. More specifically, we first trained a VGGNet-based DCNN model, termed VGGNet-FE, which can extract fold-specific features from the predicted protein residue-residue contact map for protein fold recognition. Subsequently, based on the trained VGGNet-FE, we implemented a new contact-assisted predictor, termed VGGfold, for protein fold recognition; we then visualized what features were extracted by each of the convolutional layers in VGGNet-FE using a deconvolution technique. Furthermore, we visualized the high-level semantic information, termed fold-discriminative region, of a predicted contact map from the localization map obtained from the last convolutional layer of VGGNet-FE. It is visually confirmed that VGGNet-FE could effectively extract distinct fold-discriminative regions for different types of protein folds, thereby accounting for the improved performance of VGGfold for protein fold recognition. In summary, this study is of great significance for both understanding the working principle of DCNNs in protein fold recognition and exploring the relationship between the predicted protein contact map and protein tertiary structure. This proposed visualization method is flexible and applicable to address other DCNN-based bioinformatics and computational biology questions. The online web server of VGGfold is freely available at http://csbio.njust.edu.cn/bioinf/vggfold/.

Assuntos

Biologia Computacional/métodos , Aprendizado de Máquina , Redes Neurais de Computação , Dobramento de Proteína , Proteínas/química , Visualização de Dados , Humanos , Mapas de Interação de Proteínas , Estrutura Terciária de Proteína , Proteínas/metabolismo , Semântica

4.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features.

Zhu, Yi-Heng; Hu, Jun; Ge, Fang; Li, Fuyi; Song, Jiangning; Zhang, Yang; Yu, Dong-Jun.

Brief Bioinform ; 22(3)2021 05 20.

Artigo em Inglês | MEDLINE | ID: mdl-32436937

RESUMO

X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.

Assuntos

Biologia Computacional/métodos , Cristalização/métodos , Proteínas/química , Sequência de Aminoácidos , Cristalografia por Raios X , Bases de Dados de Proteínas , Modelos Químicos

5.

Improving protein fold recognition using triplet network and ensemble deep learning.

Liu, Yan; Han, Ke; Zhu, Yi-Heng; Zhang, Ying; Shen, Long-Chen; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(6)2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-34226918

RESUMO

Protein fold recognition is a critical step toward protein structure and function prediction, aiming at providing the most likely fold type of the query protein. In recent years, the development of deep learning (DL) technique has led to massive advances in this important field, and accordingly, the sensitivity of protein fold recognition has been dramatically improved. Most DL-based methods take an intermediate bottleneck layer as the feature representation of proteins with new fold types. However, this strategy is indirect, inefficient and conditional on the hypothesis that the bottleneck layer's representation is assumed as a good representation of proteins with new fold types. To address the above problem, in this work, we develop a new computational framework by combining triplet network and ensemble DL. We first train a DL-based model, termed FoldNet, which employs triplet loss to train the deep convolutional network. FoldNet directly optimizes the protein fold embedding itself, making the proteins with the same fold types be closer to each other than those with different fold types in the new protein embedding space. Subsequently, using the trained FoldNet, we implement a new residue-residue contact-assisted predictor, termed FoldTR, which improves protein fold recognition. Furthermore, we propose a new ensemble DL method, termed FSD_XGBoost, which combines protein fold embedding with the other two discriminative fold-specific features extracted by two DL-based methods SSAfold and DeepFR. The Top 1 sensitivity of FSD_XGBoost increases to 74.8% at the fold level, which is ~9% higher than that of the state-of-the-art method. Together, the results suggest that fold-specific features extracted by different DL methods complement with each other, and their combination can further improve fold recognition at the fold level. The implemented web server of FoldTR and benchmark datasets are publicly available at http://csbio.njust.edu.cn/bioinf/foldtr/.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Modelos Moleculares , Conformação Proteica , Dobramento de Proteína , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Redes Neurais de Computação , Reprodutibilidade dos Testes , Sensibilidade e Especificidade

6.

GCmapCrys: Integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction.

Wang, Peng-Hao; Zhu, Yi-Heng; Yang, Xibei; Yu, Dong-Jun.

Anal Biochem ; 663: 115020, 2023 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-36521558

RESUMO

X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.

Assuntos

Biologia Computacional , Proteínas , Cristalização/métodos , Proteínas/química , Cristalografia por Raios X , Biologia Computacional/métodos

7.

Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.

Zhu, Yi-Heng; Zhang, Chengxin; Yu, Dong-Jun; Zhang, Yang.

PLoS Comput Biol ; 18(12): e1010793, 2022 12.

Artigo em Inglês | MEDLINE | ID: mdl-36548439

RESUMO

Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.

Assuntos

Biologia Computacional , Proteínas , Ontologia Genética , Biologia Computacional/métodos , Proteínas/genética , Proteínas/metabolismo , Redes Neurais de Computação , Idioma

8.

TargetDBP+: Enhancing the Performance of Identifying DNA-Binding Proteins via Weighted Convolutional Features.

Hu, Jun; Rao, Liang; Zhu, Yi-Heng; Zhang, Gui-Jun; Yu, Dong-Jun.

J Chem Inf Model ; 61(1): 505-515, 2021 01 25.

Artigo em Inglês | MEDLINE | ID: mdl-33410688

RESUMO

Protein-DNA interactions exist ubiquitously and play important roles in the life cycles of living cells. The accurate identification of DNA-binding proteins (DBPs) is one of the key steps to understand the mechanisms of protein-DNA interactions. Although many DBP identification methods have been proposed, the current performance is still unsatisfactory. In this study, a new method, called TargetDBP+, is developed to further enhance the performance of identifying DBPs. In TargetDBP+, five convolutional features are first extracted from five feature sources, i.e., amino acid one-hot matrix (AAOHM), position-specific scoring matrix (PSSM), predicted secondary structure probability matrix (PSSPM), predicted solvent accessibility probability matrix (PSAPM), and predicted probabilities of DNA-binding sites (PPDBSs); second, the five features are weightedly and serially combined using the weights of all of the elements learned by the differential evolution algorithm; and finally, the DBP identification model of TargetDBP+ is trained using the support vector machine (SVM) algorithm. To evaluate the developed TargetDBP+ and compare it with other existing methods, a new gold-standard benchmark data set, called UniSwiss, is constructed, which consists of 4881 DBPs and 4881 non-DBPs extracted from the UniprotKB/Swiss-Prot database. Experimental results demonstrate that TargetDBP+ can obtain an accuracy of 85.83% and precision of 88.45% covering 82.41% of all DBP data on the independent validation subset of UniSwiss, with the MCC value (0.718) being significantly higher than those of other state-of-the-art control methods. The web server of TargetDBP+ is accessible at http://csbio.njust.edu.cn/bioinf/targetdbpplus/; the UniSwiss data set and stand-alone program of TargetDBP+ are accessible at https://github.com/jun-csbio/TargetDBPplus.

Assuntos

Proteínas de Ligação a DNA , Máquina de Vetores de Suporte , Algoritmos , Sítios de Ligação , Proteínas de Ligação a DNA/metabolismo , Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica

9.

SSCpred: Single-Sequence-Based Protein Contact Prediction Using Deep Fully Convolutional Network.

Chen, Ming-Cai; Li, Yang; Zhu, Yi-Heng; Ge, Fang; Yu, Dong-Jun.

J Chem Inf Model ; 60(6): 3295-3303, 2020 06 22.

Artigo em Inglês | MEDLINE | ID: mdl-32338512

RESUMO

There has been a significant improvement in protein residue contact prediction in recent years. Nevertheless, state-of-the-art methods still show deficiencies in the contact prediction of proteins with low-homology information. These top methods depend largely on statistical features that derived from homologous sequences, but previous studies, along with our analyses, show that they are insufficient for inferencing an accurate contact map for nonhomology protein targets. To compensate, we proposed a brand new single-sequence-based contact predictor (SSCpred) that performs prediction through the deep fully convolutional network (Deep FCN) with only the target sequence itself, i.e., without additional homology information. The proposed pipeline makes good use of the target sequence by utilizing the pair-wise encoding technique and Deep FCN. Experimental results demonstrated that SSCpred can produce accurate predictions based on the efficient pipeline. Compared with several most recent methods, SSCpred achieves completive performance on nonhomology targets. Overall, we explored the possibilities of single-sequence-based contact prediction and designed a novel pipeline without using a complex and redundant feature set. The proposed SSCpred can compensate for current methods' disadvantages and achieves better performance on the nonhomology targets. The web server of SSCpred is freely available at http://csbio.njust.edu.cn/bioinf/sscpred/.

Assuntos

Biologia Computacional , Proteínas , Algoritmos , Proteínas/genética

10.

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines.

Zhu, Yi-Heng; Hu, Jun; Song, Xiao-Ning; Yu, Dong-Jun.

J Chem Inf Model ; 59(6): 3057-3071, 2019 06 24.

Artigo em Inglês | MEDLINE | ID: mdl-30943723

RESUMO

Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.

Assuntos

Proteínas de Ligação a DNA/metabolismo , DNA/metabolismo , Modelos Moleculares , Máquina de Vetores de Suporte , DNA/química , Proteínas de Ligação a DNA/química , Conformação de Ácido Nucleico , Conformação Proteica

11.

BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-methyladenosine Sites.

Xia, Yunpeng; Zhang, Ying; Liu, Dian; Zhu, Yi-Heng; Wang, Zhikang; Song, Jiangning; Yu, Dong-Jun.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Jun 24.

Artigo em Inglês | MEDLINE | ID: mdl-38913512

RESUMO

RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites. The data and code can be accessed at https://github.com/DoraemonXia/BLAM6A-Merge.

12.

Integrating unsupervised language model with multi-view multiple sequence alignments for high-accuracy inter-chain contact prediction.

Liu, Zi; Zhu, Yi-Heng; Shen, Long-Chen; Xiao, Xuan; Qiu, Wang-Ren; Yu, Dong-Jun.

Comput Biol Med ; 166: 107529, 2023 Sep 20.

Artigo em Inglês | MEDLINE | ID: mdl-37748220

RESUMO

Accurate identification of inter-chain contacts in the protein complex is critical to determine the corresponding 3D structures and understand the biological functions. We proposed a new deep learning method, ICCPred, to deduce the inter-chain contacts from the amino acid sequences of the protein complex. This pipeline was built on the designed deep residual network architecture, integrating the pre-trained language model with three multiple sequence alignments (MSAs) from different biological views. Experimental results on 709 non-redundant benchmarking protein complexes showed that the proposed ICCPred significantly increased inter-chain contact prediction accuracy compared to the state-of-the-art approaches. Detailed data analyses showed that the significant advantage of ICCPred lies in the utilization of pre-trained transformer language models which can effectively extract the complementary co-evolution diversity from three MSAs. Meanwhile, the designed deep residual network enhances the correlation between the co-evolution diversity and the patterns of inter-chain contacts. These results demonstrated a new avenue for high-accuracy deep-learning inter-chain contact prediction that is applicable to large-scale protein-protein interaction annotations from sequence alone.

13.

TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble.

Ge, Fang; Hu, Jun; Zhu, Yi-Heng; Arif, Muhammad; Yu, Dong-Jun.

Comb Chem High Throughput Screen ; 25(1): 38-52, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-33280588

RESUMO

AIM AND OBJECTIVE: Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. MATERIALS AND METHODS: In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. RESULTS: Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source codes and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.

Assuntos

Algoritmos , Mutação de Sentido Incorreto , Humanos , Proteínas/química , Software

14.

TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction.

Zhu, Yi-Heng; Zhang, Chengxin; Liu, Yan; Omenn, Gilbert S; Freddolino, Peter L; Yu, Dong-Jun; Zhang, Yang.

Genomics Proteomics Bioinformatics ; 20(5): 1013-1027, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-35568117

RESUMO

Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.

Assuntos

Biologia Computacional , Proteínas , Animais , Camundongos , Ratos , Humanos , Proteínas/metabolismo , Anotação de Sequência Molecular , Sequência de Aminoácidos , Alinhamento de Sequência , Biologia Computacional/métodos

15.

MutTMPredictor: Robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins.

Ge, Fang; Zhu, Yi-Heng; Xu, Jian; Muhammad, Arif; Song, Jiangning; Yu, Dong-Jun.

Comput Struct Biotechnol J ; 19: 6400-6416, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34938415

RESUMO

Transmembrane proteins have critical biological functions and play a role in a multitude of cellular processes including cell signaling, transport of molecules and ions across membranes. Approximately 60% of transmembrane proteins are considered as drug targets. Missense mutations in such proteins can lead to many diverse diseases and disorders, such as neurodegenerative diseases and cystic fibrosis. However, there are limited studies on mutations in transmembrane proteins. In this work, we first design a new feature encoding method, termed weight attenuation position-specific scoring matrix (WAPSSM), which builds upon the protein evolutionary information. Then, we propose a new mutation prediction algorithm (cascade XGBoost) by leveraging the idea learned from consensus predictors and gcForest. Multi-level experiments illustrate the effectiveness of WAPSSM and cascade XGBoost algorithms. Finally, based on WAPSSM and other three types of features, in combination with the cascade XGBoost algorithm, we develop a new transmembrane protein mutation predictor, named MutTMPredictor. We benchmark the performance of MutTMPredictor against several existing predictors on seven datasets. On the 546 mutations dataset, MutTMPredictor achieves the accuracy (ACC) of 0.9661 and the Matthew's Correlation Coefficient (MCC) of 0.8950. While on the 67,584 dataset, MutTMPredictor achieves an MCC of 0.7523 and area under curve (AUC) of 0.8746, which are 0.1625 and 0.0801 respectively higher than those of the existing best predictor (fathmm). Besides, MutTMPredictor also outperforms two specific predictors on the Pred-MutHTP datasets. The results suggest that MutTMPredictor can be used as an effective method for predicting and prioritizing missense mutations in transmembrane proteins. The MutTMPredictor webserver and datasets are freely accessible at http://csbio.njust.edu.cn/bioinf/muttmpredictor/ for academic use.

16.

TargetDBP: Accurate DNA-Binding Protein Prediction Via Sequence-Based Multi-View Feature Learning.

Hu, Jun; Zhou, Xiao-Gen; Zhu, Yi-Heng; Yu, Dong-Jun; Zhang, Gui-Jun.

IEEE/ACM Trans Comput Biol Bioinform ; 17(4): 1419-1429, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-30668479

RESUMO

Accurately identifying DNA-binding proteins (DBPs) from protein sequence information is an important but challenging task for protein function annotations. In this paper, we establish a novel computational method, named TargetDBP, for accurately targeting DBPs from primary sequences. In TargetDBP, four single-view features, i.e., AAC (Amino Acid Composition), PsePSSM (Pseudo Position-Specific Scoring Matrix), PsePRSA (Pseudo Predicted Relative Solvent Accessibility), and PsePPDBS (Pseudo Predicted Probabilities of DNA-Binding Sites), are first extracted to represent different base features, respectively. Second, differential evolution algorithm is employed to learn the weights of four base features. Using the learned weights, we weightedly combine these base features to form the original super feature. An excellent subset of the super feature is then selected by using a suitable feature selection algorithm SVM-REF+CBR (Support Vector Machine Recursive Feature Elimination with Correlation Bias Reduction). Finally, the prediction model is learned via using support vector machine on the selected feature subset. We also construct a new gold-standard and non-redundant benchmark dataset from PDB database to evaluate and compare the proposed TargetDBP with other existing predictors. On this new dataset, TargetDBP can achieve higher performance than other state-of-the-art predictors. The TargetDBP web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/targetdbp/ for academic use.

Assuntos

Biologia Computacional/métodos , Proteínas de Ligação a DNA , Aprendizado de Máquina , Análise de Sequência de Proteína/métodos , Algoritmos , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica , Máquina de Vetores de Suporte

17.

Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites.

Zhu, Yi-Heng; Hu, Jun; Qi, Yong; Song, Xiao-Ning; Yu, Dong-Jun.

Comb Chem High Throughput Screen ; 22(7): 455-469, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31553288

RESUMO

AIM AND OBJECTIVE: The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors. MATERIALS AND METHODS: In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm. RESULTS: Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.

Assuntos

Nucleotídeos/química , Proteínas/química , Máquina de Vetores de Suporte , Sítios de Ligação , Ensaios de Triagem em Larga Escala

18.

[A new method for predicting anterior chamber depth in intraocular lens implantation].

Song, Wen-xiao; Lu, Guang-wen; Liu, Zhi-bin; Yang, Jing-juan; Zhu, Yi-heng.

Nan Fang Yi Ke Da Xue Xue Bao ; 31(11): 1895-9, 2011 Nov.

Artigo em Zh | MEDLINE | ID: mdl-22126774

RESUMO

OBJECTIVE: To propose a method for predicting the postoperative anterior chamber depth value (ACD) based on the evaluation standard of contrast sensitivity function (CSF). METHODS: In a personalized eye model, the natural lens was replaced with an artificial lens, and the tracing method based on ZEMAX software was utilized to simulate the pseudophakic optical system. The best MTF curve was obtained through optical optimization to calculate the CSF and determine the best contrast sensitivity (CS) curve. The best ACD value was acquired according to the CSF curve. The data form 10 cataract cases without retinal diseases were obtained and the postoperative ACD value was predicted using Holladay, Hoffer Q, SRK/T and the proposed method. RESULTS: Consistency analysis of the results showed that all the ACD values predicated by the 4 methods fell in 95% consistency within the boundaries, and the differences in the largest absolute value between the 3 methods (Holladay, Hoffer Q, and SRK/T) and the proposed method were 0.73 mm, 0.65 mm, and 0.68 mm, and the calculation results of the mean value were 5.846 mm, 5.804 mm, and 5.825 mm. Clinically, the two methods were deemed to have good consistency. CONCLUSION: The proposed method establishes a connection between the contrast sensitivity curve and ACD to allow better refraction adjustment for the patients.

Assuntos

Câmara Anterior/anatomia & histologia , Implante de Lente Intraocular , Erros de Refração/diagnóstico , Erros de Refração/reabilitação , Antropometria , Humanos , Lentes Intraoculares , Período Pós-Operatório , Erros de Refração/complicações

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA