Pesquisa | BVS Violência e Saúde

1.

A comprehensive computational benchmark for evaluating deep learning-based protein function prediction approaches.

Wang, Wenkang; Shuai, Yunyan; Yang, Qiurong; Zhang, Fuhao; Zeng, Min; Li, Min.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38388682

RESUMO

Proteins play an important role in life activities and are the basic units for performing functions. Accurately annotating functions to proteins is crucial for understanding the intricate mechanisms of life and developing effective treatments for complex diseases. Traditional biological experiments struggle to keep pace with the growing number of known proteins. With the development of high-throughput sequencing technology, a wide variety of biological data provides the possibility to accurately predict protein functions by computational methods. Consequently, many computational methods have been proposed. Due to the diversity of application scenarios, it is necessary to conduct a comprehensive evaluation of these computational methods to determine the suitability of each algorithm for specific cases. In this study, we present a comprehensive benchmark, BeProf, to process data and evaluate representative computational methods. We first collect the latest datasets and analyze the data characteristics. Then, we investigate and summarize 17 state-of-the-art computational methods. Finally, we propose a novel comprehensive evaluation metric, design eight application scenarios and evaluate the performance of existing methods on these scenarios. Based on the evaluation, we provide practical recommendations for different scenarios, enabling users to select the most suitable method for their specific needs. All of these servers can be obtained from https://csuligroup.com/BEPROF and https://github.com/CSUBioGroup/BEPROF.

Assuntos

Aprendizado Profundo , Benchmarking , Proteínas , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala

2.

Partial order relation-based gene ontology embedding improves protein function prediction.

Li, Wenjing; Wang, Bin; Dai, Jin; Kou, Yan; Chen, Xiaojun; Pan, Yi; Hu, Shuangwei; Xu, Zhenjiang Zech.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38446740

RESUMO

Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.

Assuntos

Benchmarking , Biologia Computacional , Ontologia Genética , Aprendizagem , Anotação de Sequência Molecular

3.

DeepSS2GO: protein function prediction from secondary structure.

Song, Fu V; Su, Jiaqi; Huang, Sixing; Zhang, Neng; Li, Kaiyue; Ni, Ming; Liao, Maofu.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38701416

RESUMO

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.

Assuntos

Algoritmos , Biologia Computacional , Redes Neurais de Computação , Estrutura Secundária de Proteína , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas , Ontologia Genética , Análise de Sequência de Proteína/métodos , Software

4.

Rheostats, toggles, and neutrals, Oh my! A new framework for understanding how amino acid changes modulate protein function.

Swint-Kruse, Liskin; Fenton, Aron W.

J Biol Chem ; 300(3): 105736, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38336297

RESUMO

Advances in personalized medicine and protein engineering require accurately predicting outcomes of amino acid substitutions. Many algorithms correctly predict that evolutionarily-conserved positions show "toggle" substitution phenotypes, which is defined when a few substitutions at that position retain function. In contrast, predictions often fail for substitutions at the less-studied "rheostat" positions, which are defined when different amino acid substitutions at a position sample at least half of the possible functional range. This review describes efforts to understand the impact and significance of rheostat positions: (1) They have been observed in globular soluble, integral membrane, and intrinsically disordered proteins; within single proteins, their prevalence can be up to 40%. (2) Substitutions at rheostat positions can have biological consequences and â¼10% of substitutions gain function. (3) Although both rheostat and "neutral" (defined when all substitutions exhibit wild-type function) positions are nonconserved, the two classes have different evolutionary signatures. (4) Some rheostat positions have pleiotropic effects on function, simultaneously modulating multiple parameters (e.g., altering both affinity and allosteric coupling). (5) In structural studies, substitutions at rheostat positions appear to cause only local perturbations; the overall conformations appear unchanged. (6) Measured functional changes show promising correlations with predicted changes in protein dynamics; the emergent properties of predicted, dynamically coupled amino acid networks might explain some of the complex functional outcomes observed when substituting rheostat positions. Overall, rheostat positions provide unique opportunities for using single substitutions to tune protein function. Future studies of these positions will yield important insights into the protein sequence/function relationship.

Assuntos

Substituição de Aminoácidos , Aminoácidos , Proteínas , Sequência de Aminoácidos , Aminoácidos/genética , Aminoácidos/metabolismo , Sequência Conservada , Evolução Molecular , Proteínas Intrinsicamente Desordenadas/química , Proteínas Intrinsicamente Desordenadas/genética , Proteínas Intrinsicamente Desordenadas/metabolismo , Proteínas de Membrana/química , Proteínas de Membrana/genética , Proteínas de Membrana/metabolismo , Engenharia de Proteínas , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Relação Estrutura-Atividade , Humanos

5.

Substitution Models of Protein Evolution with Selection on Enzymatic Activity.

Ferreiro, David; Khalil, Ruqaiya; Sousa, Sergio F; Arenas, Miguel.

Mol Biol Evol ; 41(2)2024 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-38314876

RESUMO

Substitution models of evolution are necessary for diverse evolutionary analyses including phylogenetic tree and ancestral sequence reconstructions. At the protein level, empirical substitution models are traditionally used due to their simplicity, but they ignore the variability of substitution patterns among protein sites. Next, in order to improve the realism of the modeling of protein evolution, a series of structurally constrained substitution models were presented, but still they usually ignore constraints on the protein activity. Here, we present a substitution model of protein evolution with selection on both protein structure and enzymatic activity, and that can be applied to phylogenetics. In particular, the model considers the binding affinity of the enzyme-substrate complex as well as structural constraints that include the flexibility of structural flaps, hydrogen bonds, amino acids backbone radius of gyration, and solvent-accessible surface area that are quantified through molecular dynamics simulations. We applied the model to the HIV-1 protease and evaluated it by phylogenetic likelihood in comparison with the best-fitting empirical substitution model and a structurally constrained substitution model that ignores the enzymatic activity. We found that accounting for selection on the protein activity improves the fitting of the modeled functional regions with the real observations, especially in data with high molecular identity, which recommends considering constraints on the protein activity in the development of substitution models of evolution.

Assuntos

Aminoácidos , Evolução Molecular , Filogenia , Probabilidade , Modelos Genéticos , Substituição de Aminoácidos

6.

HiFun: homology independent protein function prediction by a novel protein-language self-attention model.

Wu, Jun; Qing, Haipeng; Ouyang, Jian; Zhou, Jiajia; Gao, Zihao; Mason, Christopher E; Liu, Zhichao; Shi, Tieliu.

Brief Bioinform ; 24(5)2023 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-37649370

RESUMO

Protein function prediction based on amino acid sequence alone is an extremely challenging but important task, especially in metagenomics/metatranscriptomics field, in which novel proteins have been uncovered exponentially from new microorganisms. Many of them are extremely low homology to known proteins and cannot be annotated with homology-based or information integrative methods. To overcome this problem, we proposed a Homology Independent protein Function annotation method (HiFun) based on a unified deep-learning model by reassembling the sequence as protein language. The robustness of HiFun was evaluated using the benchmark datasets and metrics in the CAFA3 challenge. To navigate the utility of HiFun, we annotated 2 212 663 unknown proteins and discovered novel motifs in the UHGP-50 catalog. We proved that HiFun can extract latent function related structure features which empowers it ability to achieve function annotation for non-homology proteins. HiFun can substantially improve newly proteins annotation and expand our understanding of microorganisms' adaptation in various ecological niches. Moreover, we provided a free and accessible webservice at http://www.unimd.org/HiFun, requiring only protein sequences as input, offering researchers an efficient and practical platform for predicting protein functions.

Assuntos

Benchmarking , Idioma , Sequência de Aminoácidos , Metagenômica , Anotação de Sequência Molecular

7.

IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models.

Yu, Han; Luo, Xiaozhou.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36403184

RESUMO

The prediction of peptide and protein function is important for research and industrial applications, and many machine learning methods have been developed for this purpose. The existing models have encountered many challenges, including the lack of effective and comprehensive features and the limited applicability of each model. Here, we introduce an Integrated Peptide and Protein function prediction Framework based on Fused features and Ensemble models (IPPF-FE), which can accurately capture the relationship between features and labels. The results indicated that IPPF-FE outperformed existing state-of-the-art (SOTA) models on more than 8 different categories of peptide and protein tasks. In addition, t-distributed Stochastic Neighbour Embedding demonstrated the advantages of IPPF-FE. We anticipate that our method will become a versatile tool for peptide and protein prediction tasks and shed light on the future development of related models. The model is open source and available in the GitHub repository https://github.com/Luo-SynBioLab/IPPF-FE.

Assuntos

Federação Internacional de Planejamento Familiar , Proteínas , Peptídeos , Aprendizado de Máquina

8.

MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction.

Wang, Zhongyu; Deng, Zhaohong; Zhang, Wei; Lou, Qiongdan; Choi, Kup-Sze; Wei, Zhisheng; Wang, Lei; Wu, Jing.

Brief Bioinform ; 24(4)2023 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-37258453

RESUMO

Protein is the most important component in organisms and plays an indispensable role in life activities. In recent years, a large number of intelligent methods have been proposed to predict protein function. These methods obtain different types of protein information, including sequence, structure and interaction network. Among them, protein sequences have gained significant attention where methods are investigated to extract the information from different views of features. However, how to fully exploit the views for effective protein sequence analysis remains a challenge. In this regard, we propose a multi-view, multi-scale and multi-attention deep neural model (MMSMA) for protein function prediction. First, MMSMA extracts multi-view features from protein sequences, including one-hot encoding features, evolutionary information features, deep semantic features and overlapping property features based on physiochemistry. Second, a specific multi-scale multi-attention deep network model (MSMA) is built for each view to realize the deep feature learning and preliminary classification. In MSMA, both multi-scale local patterns and long-range dependence from protein sequences can be captured. Third, a multi-view adaptive decision mechanism is developed to make a comprehensive decision based on the classification results of all the views. To further improve the prediction performance, an extended version of MMSMA, MMSMAPlus, is proposed to integrate homology-based protein prediction under the framework of multi-view deep neural model. Experimental results show that the MMSMAPlus has promising performance and is significantly superior to the state-of-the-art methods. The source code can be found at https://github.com/wzy-2020/MMSMAPlus.

Assuntos

Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos , Software , Análise de Sequência de Proteína

9.

Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion.

Yuan, Qianmu; Xie, Junjie; Xie, Jiancong; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 24(3)2023 05 19.

Artigo em Inglês | MEDLINE | ID: mdl-36964722

RESUMO

Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor, which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5, 27.3 and 10.1% in area under the precision-recall curve on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction. The datasets, source codes and trained models of SPROF-GO are available at https://github.com/biomed-AI/SPROF-GO. The SPROF-GO web server is freely available at http://bio-web1.nscc-gz.cn/app/sprof-go.

Assuntos

Proteínas , Software , Proteínas/metabolismo , Algoritmos , Biologia Computacional/métodos , Ontologia Genética

10.

Large-scale predicting protein functions through heterogeneous feature fusion.

Zheng, Rongtao; Huang, Zhijian; Deng, Lei.

Brief Bioinform ; 24(4)2023 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-37401369

RESUMO

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.

Assuntos

Biologia Computacional , Proteínas , Humanos , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Redes Neurais de Computação , Bases de Dados Factuais , Bases de Dados de Proteínas

11.

HNetGO: protein function prediction via heterogeneous network transformer.

Zhang, Xiaoshuai; Guo, Huannan; Zhang, Fan; Wang, Xuan; Wu, Kaitao; Qiu, Shizheng; Liu, Bo; Wang, Yadong; Hu, Yang; Li, Junyi.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37861172

RESUMO

Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.

Assuntos

Redes Neurais de Computação , Mapas de Interação de Proteínas , Humanos , Sequência de Aminoácidos , Ontologia Genética , Anotação de Sequência Molecular

12.

CLIP: accurate prediction of disordered linear interacting peptides from protein sequences using co-evolutionary information.

Peng, Zhenling; Li, Zixia; Meng, Qiaozhen; Zhao, Bi; Kurgan, Lukasz.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36458437

RESUMO

One of key features of intrinsically disordered regions (IDRs) is facilitation of protein-protein and protein-nucleic acids interactions. These disordered binding regions include molecular recognition features (MoRFs), short linear motifs (SLiMs) and longer binding domains. Vast majority of current predictors of disordered binding regions target MoRFs, with a handful of methods that predict SLiMs and disordered protein-binding domains. A new and broader class of disordered binding regions, linear interacting peptides (LIPs), was introduced recently and applied in the MobiDB resource. LIPs are segments in protein sequences that undergo disorder-to-order transition upon binding to a protein or a nucleic acid, and they cover MoRFs, SLiMs and disordered protein-binding domains. Although current predictors of MoRFs and disordered protein-binding regions could be used to identify some LIPs, there are no dedicated sequence-based predictors of LIPs. To this end, we introduce CLIP, a new predictor of LIPs that utilizes robust logistic regression model to combine three complementary types of inputs: co-evolutionary information derived from multiple sequence alignments, physicochemical profiles and disorder predictions. Ablation analysis suggests that the co-evolutionary information is particularly useful for this prediction and that combining the three inputs provides substantial improvements when compared to using these inputs individually. Comparative empirical assessments using low-similarity test datasets reveal that CLIP secures area under receiver operating characteristic curve (AUC) of 0.8 and substantially improves over the results produced by the closest current tools that predict MoRFs and disordered protein-binding regions. The webserver of CLIP is freely available at http://biomine.cs.vcu.edu/servers/CLIP/ and the standalone code can be downloaded from http://yanglab.qd.sdu.edu.cn/download/CLIP/.

Assuntos

Proteínas Intrinsicamente Desordenadas , Proteínas Intrinsicamente Desordenadas/química , Biologia Computacional/métodos , Sequência de Aminoácidos , Peptídeos/metabolismo , Domínios Proteicos , Bases de Dados de Proteínas , Ligação Proteica

13.

Grain protein function prediction based on self-attention mechanism and bidirectional LSTM.

Liu, Jing; Tang, Xinghua; Guan, Xiao.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36567619

RESUMO

With the development of genome sequencing technology, using computing technology to predict grain protein function has become one of the important tasks of bioinformatics. The protein data of four grains, soybean, maize, indica and japonica are selected in this experimental dataset. In this paper, a novel neural network algorithm Chemical-SA-BiLSTM is proposed for grain protein function prediction. The Chemical-SA-BiLSTM algorithm fuses the chemical properties of proteins on the basis of amino acid sequences, and combines the self-attention mechanism with the bidirectional Long Short-Term Memory network. The experimental results show that the Chemical-SA-BiLSTM algorithm is superior to other classical neural network algorithms, and can more accurately predict the protein function, which proves the effectiveness of the Chemical-SA-BiLSTM algorithm in the prediction of grain protein function. The source code of our method is available at https://github.com/HwaTong/Chemical-SA-BiLSTM.

Assuntos

Proteínas de Grãos , Redes Neurais de Computação , Algoritmos , Proteínas/química , Software

14.

Poincaré maps for visualization of large protein families.

Susmelj, Anna Klimovskaia; Ren, Yani; Vander Meersche, Yann; Gelly, Jean-Christophe; Galochkina, Tatiana.

Brief Bioinform ; 24(3)2023 05 19.

Artigo em Inglês | MEDLINE | ID: mdl-36946414

RESUMO

In the era of constantly increasing amounts of the available protein data, a relevant and interpretable visualization becomes crucial, especially for tasks requiring human expertise. Poincaré disk projection has previously demonstrated its important efficiency for visualization of biological data such as single-cell RNAseq data. Here, we develop a new method PoincaréMSA for visual representation of complex relationships between protein sequences based on Poincaré maps embedding. We demonstrate its efficiency and potential for visualization of protein family topology as well as evolutionary and functional annotation of uncharacterized sequences. PoincaréMSA is implemented in open source Python code with available interactive Google Colab notebooks as described at https://www.dsimb.inserm.fr/POINCARE_MSA.

Assuntos

Proteínas , Software , Humanos , Sequência de Aminoácidos , Evolução Biológica

15.

UniProt and Mass Spectrometry-Based Proteomics-A 2-Way Working Relationship.

Bowler-Barnett, E H; Fan, J; Luo, J; Magrane, M; Martin, M J; Orchard, S.

Mol Cell Proteomics ; 22(8): 100591, 2023 08.

Artigo em Inglês | MEDLINE | ID: mdl-37301379

RESUMO

The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.

Assuntos

Proteoma , Proteômica , Humanos , Proteoma/genética , Bases de Dados de Proteínas , Sequência de Aminoácidos , Peptídeos

16.

Protein function prediction through multi-view multi-label latent tensor reconstruction.

Armah-Sekum, Robert Ebo; Szedmak, Sandor; Rousu, Juho.

BMC Bioinformatics ; 25(1): 174, 2024 May 02.

Artigo em Inglês | MEDLINE | ID: mdl-38698340

RESUMO

BACKGROUND: In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS: We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION: The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .

Assuntos

Biologia Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Bases de Dados de Proteínas , Algoritmos

17.

KEGG orthology prediction of bacterial proteins using natural language processing.

Chen, Jing; Wu, Haoyu; Wang, Ning.

BMC Bioinformatics ; 25(1): 146, 2024 Apr 11.

Artigo em Inglês | MEDLINE | ID: mdl-38600441

RESUMO

BACKGROUND: The advent of high-throughput technologies has led to an exponential increase in uncharacterized bacterial protein sequences, surpassing the capacity of manual curation. A large number of bacterial protein sequences remain unannotated by Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology, making it necessary to use auto annotation tools. These tools are now indispensable in the biological research landscape, bridging the gap between the vastness of unannotated sequences and meaningful biological insights. RESULTS: In this work, we propose a novel pipeline for KEGG orthology annotation of bacterial protein sequences that uses natural language processing and deep learning. To assess the effectiveness of our pipeline, we conducted evaluations using the genomes of two randomly selected species from the KEGG database. In our evaluation, we obtain competitive results on precision, recall, and F1 score, with values of 0.948, 0.947, and 0.947, respectively. CONCLUSIONS: Our experimental results suggest that our pipeline demonstrates performance comparable to traditional methods and excels in identifying distant relatives with low sequence identity. This demonstrates the potential of our pipeline to significantly improve the accuracy and comprehensiveness of KEGG orthology annotation, thereby advancing our understanding of functional relationships within biological systems.

Assuntos

Proteínas de Bactérias , Processamento de Linguagem Natural , Genoma , Anotação de Sequência Molecular , Sequência de Aminoácidos

18.

Identification and classification of papain-like cysteine proteinases.

Ozhelvaci, Fatih; Steczkiewicz, Kamil.

J Biol Chem ; 299(6): 104801, 2023 06.

Artigo em Inglês | MEDLINE | ID: mdl-37164157

RESUMO

Papain-like cysteine peptidases form a big and highly diverse superfamily of proteins involved in many important biological functions, such as protein turnover, deubiquitination, tissue remodeling, blood clotting, virulence, defense, and cell wall remodeling. High sequence and structure diversity observed within these proteins hinders their comprehensive classification as well as the identification of new representatives. Moreover, in general protein databases, many families already classified as papain like lack details regarding their mechanism of action or biological function. Here, we use transitive remote homology searches and 3D modeling to newly classify 21 families to the papain-like cysteine peptidase superfamily. We attempt to predict their biological function and provide structural characterization of 89 protein clusters defined based on sequence similarity altogether spanning 106 papain-like families. Moreover, we systematically discuss observed diversity in sequences, structures, and catalytic sites. Eventually, we expand the list of human papain-related proteins by seven representatives, including dopamine receptor-interacting protein 1 as potential deubiquitinase, and centriole duplication regulating CEP76 as retaining catalytically active peptidase-like domain. The presented results not only provide structure-based rationales to already existing peptidase databases but also may inspire further experimental research focused on peptidase-related biological processes.

Assuntos

Cisteína Proteases , Papaína , Humanos , Domínio Catalítico , Centríolos/metabolismo , Cisteína Proteases/química , Cisteína Proteases/classificação , Cisteína Proteases/metabolismo , Enzimas Desubiquitinantes/metabolismo , Modelos Moleculares , Papaína/química , Papaína/classificação , Bases de Dados de Proteínas

19.

iNucRes-ASSH: Identifying nucleic acid-binding residues in proteins by using self-attention-based structure-sequence hybrid neural network.

Zhang, Jun; Chen, Qingcai; Liu, Bin.

Proteins ; 92(3): 395-410, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-37915276

RESUMO

Interaction between proteins and nucleic acids is crucial to many cellular activities. Accurately detecting nucleic acid-binding residues (NABRs) in proteins can help researchers better understand the interaction mechanism between proteins and nucleic acids. Structure-based methods can generally make more accurate predictions than sequence-based methods. However, the existing structure-based methods are sensitive to protein conformational changes, causing limited generalizability. More effective and robust approaches should be further explored. In this study, we propose iNucRes-ASSH to identify nucleic acid-binding residues with a self-attention-based structure-sequence hybrid neural network. It improves the generalizability and robustness of NABR prediction from two levels: residue representation and prediction model. Experimental results show that iNucRes-ASSH can predict the nucleic acid-binding residues even when the experimentally validated structures are unavailable and outperforms five competing methods on a recent benchmark dataset and a widely used test dataset.

Assuntos

Algoritmos , Ácidos Nucleicos , Proteínas/química , Redes Neurais de Computação

20.

Transient excited states of the metamorphic protein Mad2 and their implications for function.

Jain, Shefali; Sekhar, Ashok.

Proteins ; 2024 Jan 14.

Artigo em Inglês | MEDLINE | ID: mdl-38221646

RESUMO

The spindle checkpoint complex is a key surveillance mechanism in cell division that prevents premature separation of sister chromatids. Mad2 is an integral component of this spindle checkpoint complex that recognizes cognate substrates such as Mad1 and Cdc20 in its closed (C-Mad2) conformation by fastening a "seatbelt" around short peptide regions that bind to the substrate recognition site. Mad2 is also a metamorphic protein that adopts not only the fold found in C-Mad2, but also a structurally distinct open conformation (O-Mad2) which is incapable of binding substrates. Here, we show using chemical exchange saturation transfer (CEST) and relaxation dispersion (CPMG) NMR experiments that Mad2 transiently populates three other higher free energy states with millisecond lifetimes, two in equilibrium with C-Mad2 (E1 and E2) and one with O-Mad2 (E3). E1 is a mimic of substrate-bound C-Mad2 in which the N-terminus of one C-Mad2 molecule inserts into the seatbelt region of a second molecule of C-Mad2, providing a potential pathway for autoinhibition of C-Mad2. E2 is the "unbuckled" conformation of C-Mad2 that facilitates the triage of molecules along competing fold-switching and substrate binding pathways. The E3 conformation that coexists with O-Mad2 shows fluctuations at a hydrophobic lock that is required for stabilizing the O-Mad2 fold and we hypothesize that E3 represents an early intermediate on-pathway towards conversion to C-Mad2. Collectively, the NMR data highlight the rugged free energy landscape of Mad2 with multiple low-lying intermediates that interlink substrate-binding and fold-switching, and also emphasize the role of molecular dynamics in its function.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA