Search | VHL Regional Portal

1.

Identification of domains in Plasmodium falciparum proteins of unknown function using DALI search on AlphaFold predictions.

Behrens, Hannah Michaela; Spielmann, Tobias.

Sci Rep ; 14(1): 10527, 2024 05 08.

Article in English | MEDLINE | ID: mdl-38719885

ABSTRACT

Plasmodium falciparum, the causative agent of malaria, poses a significant global health challenge, yet much of its biology remains elusive. A third of the genes in the P. falciparum genome lack annotations regarding their function, impeding our understanding of the parasite's biology. In this study, we employ structure predictions and the DALI search algorithm to analyse proteins encoded by uncharacterized genes in the reference strain 3D7 of P. falciparum. By comparing AlphaFold predictions to experimentally determined protein structures in the Protein Data Bank, we found similarities to known domains in 353 proteins of unknown function, shedding light on their potential functions. The lowest-scoring 5% of similarities were additionally validated using the size-independent TM-align algorithm, confirming the detected similarities in 88% of the cases. Notably, in over 70 P. falciparum proteins the presence of domains resembling heptatricopeptide repeats, which are typically involvement in RNA binding and processing, was detected. This suggests this family, which is important in transcription in mitochondria and apicoplasts, is much larger in Plasmodium parasites than previously thought. The results of this domain search provide a resource to the malaria research community that is expected to inform and enable experimental studies.

Subject(s)

Plasmodium falciparum , Protozoan Proteins , Plasmodium falciparum/genetics , Plasmodium falciparum/metabolism , Protozoan Proteins/genetics , Protozoan Proteins/metabolism , Protozoan Proteins/chemistry , Algorithms , Protein Domains , Databases, Protein , Models, Molecular

2.

Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.

Peng, Hui; Wang, He; Kong, Weijia; Li, Jinyan; Goh, Wilson Wen Bin.

Nat Commun ; 15(1): 3922, 2024 May 09.

Article in English | MEDLINE | ID: mdl-38724498

ABSTRACT

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Subject(s)

Proteomics , Proteomics/methods , Workflow , Machine Learning , Proteome/metabolism , Humans , Algorithms , Databases, Protein

3.

TransPTM: a transformer-based model for non-histone acetylation site prediction.

Meng, Lingkuan; Chen, Xingjian; Cheng, Ke; Chen, Nanjun; Zheng, Zetian; Wang, Fuzhou; Sun, Hongyan; Wong, Ka-Chun.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38725156

ABSTRACT

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.

Subject(s)

Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Computational Biology/methods , Databases, Protein , Software , Algorithms , Humans , Proteins/chemistry , Proteins/metabolism

4.

Fragment ion intensity prediction improves the identification rate of non-tryptic peptides in timsTOF.

Adams, Charlotte; Gabriel, Wassim; Laukens, Kris; Picciani, Mario; Wilhelm, Mathias; Bittremieux, Wout; Boonen, Kurt.

Nat Commun ; 15(1): 3956, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38730277

ABSTRACT

Immunopeptidomics is crucial for immunotherapy and vaccine development. Because the generation of immunopeptides from their parent proteins does not adhere to clear-cut rules, rather than being able to use known digestion patterns, every possible protein subsequence within human leukocyte antigen (HLA) class-specific length restrictions needs to be considered during sequence database searching. This leads to an inflation of the search space and results in lower spectrum annotation rates. Peptide-spectrum match (PSM) rescoring is a powerful enhancement of standard searching that boosts the spectrum annotation performance. We analyze 302,105 unique synthesized non-tryptic peptides from the ProteomeTools project on a timsTOF-Pro to generate a ground-truth dataset containing 93,227 MS/MS spectra of 74,847 unique peptides, that is used to fine-tune the deep learning-based fragment ion intensity prediction model Prosit. We demonstrate up to 3-fold improvement in the identification of immunopeptides, as well as increased detection of immunopeptides from low input samples.

Subject(s)

Deep Learning , Peptides , Tandem Mass Spectrometry , Humans , Peptides/chemistry , Peptides/immunology , Tandem Mass Spectrometry/methods , Databases, Protein , Proteomics/methods , HLA Antigens/immunology , HLA Antigens/genetics , Software , Ions

5.

DeepSub: Utilizing Deep Learning for Predicting the Number of Subunits in Homo-Oligomeric Protein Complexes.

Deng, Rui; Wu, Ke; Lin, Jiawei; Wang, Dehang; Huang, Yuanyuan; Li, Yang; Shi, Zhenkun; Zhang, Zihan; Wang, Zhiwen; Mao, Zhitao; Liao, Xiaoping; Ma, Hongwu.

Int J Mol Sci ; 25(9)2024 Apr 28.

Article in English | MEDLINE | ID: mdl-38732022

ABSTRACT

The molecular weight (MW) of an enzyme is a critical parameter in enzyme-constrained models (ecModels). It is determined by two factors: the presence of subunits and the abundance of each subunit. Although the number of subunits (NS) can potentially be obtained from UniProt, this information is not readily available for most proteins. In this study, we addressed this gap by extracting and curating subunit information from the UniProt database to establish a robust benchmark dataset. Subsequently, we propose a novel model named DeepSub, which leverages the protein language model and Bi-directional Gated Recurrent Unit (GRU), to predict NS in homo-oligomers solely based on protein sequences. DeepSub demonstrates remarkable accuracy, achieving an accuracy rate as high as 0.967, surpassing the performance of QUEEN. To validate the effectiveness of DeepSub, we performed predictions for protein homo-oligomers that have been reported in the literature but are not documented in the UniProt database. Examples include homoserine dehydrogenase from Corynebacterium glutamicum, Matrilin-4 from Mus musculus and Homo sapiens, and the Multimerins protein family from M. musculus and H. sapiens. The predicted results align closely with the reported findings in the literature, underscoring the reliability and utility of DeepSub.

Subject(s)

Databases, Protein , Deep Learning , Protein Subunits , Protein Subunits/chemistry , Protein Subunits/metabolism , Animals , Humans , Protein Multimerization , Mice , Computational Biology/methods

6.

DIRMC: a database of immunotherapy-related molecular characteristics.

Liu, Yue; Zhou, Yuhuan; Hu, Xiumei; Le-Ge, Wuri; Wang, Haoyan; Jiang, Tao; Li, Junyi; Hu, Yang; Wang, Yadong.

Database (Oxford) ; 20242024 May 06.

Article in English | MEDLINE | ID: mdl-38713861

ABSTRACT

Cancer immunotherapy has brought about a revolutionary breakthrough in the field of cancer treatment. Immunotherapy has changed the treatment landscape for a variety of solid and hematologic malignancies. To assist researchers in efficiently uncovering valuable information related to cancer immunotherapy, we have presented a manually curated comprehensive database called DIRMC, which focuses on molecular features involved in cancer immunotherapy. All the content was collected manually from published literature, authoritative clinical trial data submitted by clinicians, some databases for drug target prediction such as DrugBank, and some experimentally confirmed high-throughput data sets for the characterization of immune-related molecular interactions in cancer, such as a curated database of T-cell receptor sequences with known antigen specificity (VDJdb), a pathology-associated TCR database (McPAS-TCR) et al. By constructing a fully connected functional network, ranging from cancer-related gene mutations to target genes to translated target proteins to protein regions or sites that may specifically affect protein function, we aim to comprehensively characterize molecular features related to cancer immunotherapy. We have developed the scoring criteria to assess the reliability of each MHC-peptide-T-cell receptor (TCR) interaction item to provide a reference for users. The database provides a user-friendly interface to browse and retrieve data by genes, target proteins, diseases and more. DIRMC also provides a download and submission page for researchers to access data of interest for further investigation or submit new interactions related to cancer immunotherapy targets. Furthermore, DIRMC provides a graphical interface to help users predict the binding affinity between their own peptide of interest and MHC or TCR. This database will provide researchers with a one-stop resource to understand cancer immunotherapy-related targets as well as data on MHC-peptide-TCR interactions. It aims to offer reliable molecular characteristics support for both the analysis of the current status of cancer immunotherapy and the development of new immunotherapy. DIRMC is available at http://www.dirmc.tech/. Database URL: http://www.dirmc.tech/.

Subject(s)

Immunotherapy , Neoplasms , Immunotherapy/methods , Humans , Neoplasms/immunology , Neoplasms/genetics , Neoplasms/therapy , Receptors, Antigen, T-Cell/immunology , Receptors, Antigen, T-Cell/genetics , Databases, Protein , User-Computer Interface

7.

Scoring alignments by embedding vector similarity.

Ashrafzadeh, Sepehr; Golding, G Brian; Ilie, Silvana; Ilie, Lucian.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38695119

ABSTRACT

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.

Subject(s)

Algorithms , Computational Biology , Sequence Alignment , Sequence Alignment/methods , Computational Biology/methods , Software , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Deep Learning , Databases, Protein

8.

ZEPPI: Proteome-scale sequence-based evaluation of protein-protein interaction models.

Zhao, Haiqing; Petrey, Donald; Murray, Diana; Honig, Barry.

Proc Natl Acad Sci U S A ; 121(21): e2400260121, 2024 May 21.

Article in English | MEDLINE | ID: mdl-38743624

ABSTRACT

We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence coevolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI can be implemented on a proteome-wide scale and is applied here to millions of structural models of dimeric complexes in the Escherichia coli and human interactomes found in the PrePPI database. PrePPI's scoring function is based primarily on the evaluation of protein-protein interfaces, and ZEPPI adds a new feature to this analysis through the incorporation of evolutionary information. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. As we discuss, the standard CAPRI scores used to evaluate docking models are based on model quality and not on the ability to give yes/no answers as to whether two proteins interact. ZEPPI is able to detect weak signals from PPI models that the CAPRI scores define as incorrect and, similarly, to identify potential PPIs defined as low confidence by the current PrePPI scoring function. A number of examples that illustrate how the combination of PrePPI and ZEPPI can yield functional hypotheses are provided.

Subject(s)

Proteome , Proteome/metabolism , Humans , Protein Interaction Mapping/methods , Models, Molecular , Escherichia coli/metabolism , Escherichia coli/genetics , Databases, Protein , Protein Binding , Escherichia coli Proteins/metabolism , Escherichia coli Proteins/chemistry , Escherichia coli Proteins/genetics , Proteins/chemistry , Proteins/metabolism , Sequence Alignment

9.

Analysis of AlphaMissense data in different protein groups and structural context.

Tordai, Hedvig; Torres, Odalys; Csepi, Máté; Padányi, Rita; Lukács, Gergely L; Hegedus, Tamás.

Sci Data ; 11(1): 495, 2024 May 14.

Article in English | MEDLINE | ID: mdl-38744964

ABSTRACT

Single amino acid substitutions can profoundly affect protein folding, dynamics, and function. The ability to discern between benign and pathogenic substitutions is pivotal for therapeutic interventions and research directions. Given the limitations in experimental examination of these variants, AlphaMissense has emerged as a promising predictor of the pathogenicity of missense variants. Since heterogenous performance on different types of proteins can be expected, we assessed the efficacy of AlphaMissense across several protein groups (e.g. soluble, transmembrane, and mitochondrial proteins) and regions (e.g. intramembrane, membrane interacting, and high confidence AlphaFold segments) using ClinVar data for validation. Our comprehensive evaluation showed that AlphaMissense delivers outstanding performance, with MCC scores predominantly between 0.6 and 0.74. We observed low performance on disordered datasets and ClinVar data related to the CFTR ABC protein. However, a superior performance was shown when benchmarked against the high quality CFTR2 database. Our results with CFTR emphasizes AlphaMissense's potential in pinpointing functional hot spots, with its performance likely surpassing benchmarks calculated from ClinVar and ProteinGym datasets.

Subject(s)

Cystic Fibrosis Transmembrane Conductance Regulator , Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Cystic Fibrosis Transmembrane Conductance Regulator/chemistry , Proteins/chemistry , Proteins/genetics , Protein Folding , Humans , Databases, Protein , Amino Acid Substitution , Mutation, Missense

10.

Protein features fusion using attributed network embedding for predicting protein-protein interaction.

Cao, Mei-Yuan; Zainudin, Suhaila; Daud, Kauthar Mohd.

BMC Genomics ; 25(1): 466, 2024 May 13.

Article in English | MEDLINE | ID: mdl-38741045

ABSTRACT

BACKGROUND: Protein-protein interactions (PPIs) hold significant importance in biology, with precise PPI prediction as a pivotal factor in comprehending cellular processes and facilitating drug design. However, experimental determination of PPIs is laborious, time-consuming, and often constrained by technical limitations. METHODS: We introduce a new node representation method based on initial information fusion, called FFANE, which amalgamates PPI networks and protein sequence data to enhance the precision of PPIs' prediction. A Gaussian kernel similarity matrix is initially established by leveraging protein structural resemblances. Concurrently, protein sequence similarities are gauged using the Levenshtein distance, enabling the capture of diverse protein attributes. Subsequently, to construct an initial information matrix, these two feature matrices are merged by employing weighted fusion to achieve an organic amalgamation of structural and sequence details. To gain a more profound understanding of the amalgamated features, a Stacked Autoencoder (SAE) is employed for encoding learning, thereby yielding more representative feature representations. Ultimately, classification models are trained to predict PPIs by using the well-learned fusion feature. RESULTS: When employing 5-fold cross-validation experiments on SVM, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets, respectively. CONCLUSION: Experimental findings across various authentic datasets validate the efficacy and superiority of this fusion feature representation approach, underscoring its potential value in bioinformatics.

Subject(s)

Computational Biology , Protein Interaction Mapping , Protein Interaction Mapping/methods , Computational Biology/methods , Algorithms , Helicobacter pylori/metabolism , Helicobacter pylori/genetics , Support Vector Machine , Proteins/metabolism , Proteins/chemistry , Humans , Protein Interaction Maps , Databases, Protein

11.

Harnessing the 3D-Beacons Network: A Comprehensive Guide to Accessing and Displaying Protein Structure Data.

Magaña, Paulyna; Nair, Sreenath; Varadi, Mihaly; Velankar, Sameer.

Curr Protoc ; 4(5): e1047, 2024 May.

Article in English | MEDLINE | ID: mdl-38720559

ABSTRACT

Recent advancements in protein structure determination and especially in protein structure prediction techniques have led to the availability of vast amounts of macromolecular structures. However, the accessibility and integration of these structures into scientific workflows are hindered by the lack of standardization among publicly available data resources. To address this issue, we introduced the 3D-Beacons Network, a unified platform that aims to establish a standardized framework for accessing and displaying protein structure data. In this article, we highlight the importance of standardized approaches for accessing protein structure data and showcase the capabilities of 3D-Beacons. We describe four protocols for finding and accessing macromolecular structures from various specialist data resources via 3D-Beacons. First, we describe three scenarios for programmatically accessing and retrieving data using the 3D-Beacons API. Next, we show how to perform sequence-based searches to find structures from model providers. Then, we demonstrate how to search for structures and fetch them directly into a workflow using JalView. Finally, we outline the process of facilitating access to data from providers interested in contributing their structures to the 3D-Beacons Network. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Programmatic access to the 3D-Beacons API Basic Protocol 2: Sequence-based search using the 3D-Beacons API Basic Protocol 3: Accessing macromolecules from 3D-Beacons with JalView Basic Protocol 4: Enhancing data accessibility through 3D-Beacons.

Subject(s)

Protein Conformation , Proteins , Proteins/chemistry , Databases, Protein , Software

12.

REMEMProt: a resource of membrane-enriched proteome profiles, their disease associations, and biomarker status.

Aravind, Anjana; Nandakumar, Revathy; Ahmed, Mukhtar; Nisar, Mahammad; Palollathil, Akhina; Kanichery, Anagha; Sreelan, Sourav; Sinan, Kp Munavvar; Balaya, Rex Devasahayam Arokia; Vijayakumar, Manavalan; Prasad, Thottethodi Subrahmanya Keshava; Raju, Rajesh.

Life Sci Alliance ; 7(7)2024 Jul.

Article in English | MEDLINE | ID: mdl-38719747

ABSTRACT

The differential expression of plasma membrane proteins is integrally analyzed for their diagnosis, prognosis, and therapeutic applications in diverse clinical manifestations. Necessarily, distinct membrane protein enrichment methods and mass spectrometry platforms are employed for their global and relative quantitation. First of its kind to explore, we compiled membrane-associated proteomes in human and mouse systems into a database named, Resource of Experimental Membrane-Enriched Mass spectrometry-derived Proteome (REMEMProt). It currently hosts 14,626 proteins (9,507 proteins in Homo sapiens; 5,119 proteins in Mus musculus) with information on their membrane-protein enrichment methods, experimental/physiological context of detection in cells or tissues, transmembrane domain analysis, and their current attribution as biomarkers. Based on these annotations and the transmembrane domain analysis in proteins or their binary/complex protein-protein interactors, REMEMProt facilitates the assessment of the plasma membrane localization potential of proteins through batch query. A cross-study enrichment analysis platform is enabled in REMEMProt for comparative analysis of proteomes using novel/modified membrane enrichment methods and evaluation of methods for targeted enrichment of membrane proteins. REMEMProt data are made freely accessible to explore and download at https://rememprot.ciods.in/.

Subject(s)

Biomarkers , Databases, Protein , Membrane Proteins , Proteome , Proteomics , Humans , Proteome/metabolism , Membrane Proteins/metabolism , Biomarkers/metabolism , Animals , Mice , Proteomics/methods , Cell Membrane/metabolism , Mass Spectrometry/methods

13.

DeepSS2GO: protein function prediction from secondary structure.

Song, Fu V; Su, Jiaqi; Huang, Sixing; Zhang, Neng; Li, Kaiyue; Ni, Ming; Liao, Maofu.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38701416

ABSTRACT

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.

Subject(s)

Algorithms , Computational Biology , Neural Networks, Computer , Protein Structure, Secondary , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Computational Biology/methods , Databases, Protein , Gene Ontology , Sequence Analysis, Protein/methods , Software

14.

Evaluating large language models for annotating proteins.

Vitale, Rosario; Bugnon, Leandro A; Fenoy, Emilio Luis; Milone, Diego H; Stegmayer, Georgina.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38706315

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.

Subject(s)

Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning

15.

GSScore: a novel Graphormer-based shell-like scoring method for protein-ligand docking.

Guo, Linyuan; Wang, Jianxin.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38706316

ABSTRACT

Protein-ligand interactions (PLIs) are essential for cellular activities and drug discovery. But due to the complexity and high cost of experimental methods, there is a great demand for computational approaches to recognize PLI patterns, such as protein-ligand docking. In recent years, more and more models based on machine learning have been developed to directly predict the root mean square deviation (RMSD) of a ligand docking pose with reference to its native binding pose. However, new scoring methods are pressingly needed in methodology for more accurate RMSD prediction. We present a new deep learning-based scoring method for RMSD prediction of protein-ligand docking poses based on a Graphormer method and Shell-like graph architecture, named GSScore. To recognize near-native conformations from a set of poses, GSScore takes atoms as nodes and then establishes the docking interface of protein-ligand into multiple bipartite graphs within different shell ranges. Benefiting from the Graphormer and Shell-like graph architecture, GSScore can effectively capture the subtle differences between energetically favorable near-native conformations and unfavorable non-native poses without extra information. GSScore was extensively evaluated on diverse test sets including a subset of PDBBind version 2019, CASF2016 as well as DUD-E, and obtained significant improvements over existing methods in terms of RMSE, $R$ (Pearson correlation coefficient), Spearman correlation coefficient and Docking power.

Subject(s)

Molecular Docking Simulation , Proteins , Ligands , Proteins/chemistry , Proteins/metabolism , Protein Binding , Software , Algorithms , Computational Biology/methods , Protein Conformation , Databases, Protein , Deep Learning

16.

Protein function prediction through multi-view multi-label latent tensor reconstruction.

Armah-Sekum, Robert Ebo; Szedmak, Sandor; Rousu, Juho.

BMC Bioinformatics ; 25(1): 174, 2024 May 02.

Article in English | MEDLINE | ID: mdl-38698340

ABSTRACT

BACKGROUND: In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS: We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION: The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .

Subject(s)

Computational Biology , Proteins , Proteins/chemistry , Proteins/metabolism , Computational Biology/methods , Databases, Protein , Algorithms

17.

Freeprotmap: waiting-free prediction method for protein distance map.

Huang, Jiajian; Li, Jinpeng; Chen, Qinchang; Wang, Xia; Chen, Guangyong; Tang, Jin.

BMC Bioinformatics ; 25(1): 176, 2024 May 04.

Article in English | MEDLINE | ID: mdl-38704533

ABSTRACT

BACKGROUND: Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT: In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION: Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.

Subject(s)

Proteins , Proteins/chemistry , Computational Biology/methods , Databases, Protein , Protein Conformation , Algorithms , Sequence Analysis, Protein/methods , Neural Networks, Computer

18.

AlphaFun: Structural-Alignment-Based Proteome Annotation Reveals why the Functionally Unknown Proteins (uPE1) Are So Understudied.

Pan, Hengxin; Wu, Zhenqi; Liu, Wanting; Zhang, Gong.

J Proteome Res ; 23(5): 1593-1602, 2024 May 03.

Article in English | MEDLINE | ID: mdl-38626392

ABSTRACT

With the rapid expansion of sequencing of genomes, the functional annotation of proteins becomes a bottleneck in understanding proteomes. The Chromosome-centric Human Proteome Project (C-HPP) aims to identify all proteins encoded by the human genome and find functional annotations for them. However, until now there are still 1137 identified human proteins without functional annotation, called uPE1 proteins. Sequence alignment was insufficient to predict their functions, and the crystal structures of most proteins were unavailable. In this study, we demonstrated a new functional annotation strategy, AlphaFun, based on structural alignment using deep-learning-predicted protein structures. Using this strategy, we functionally annotated 99% of the human proteome, including the uPE1 proteins and missing proteins, which have not been identified yet. The accuracy of the functional annotations was validated using the known-function proteins. The uPE1 proteins shared similar functions to the known-function PE1 proteins and tend to express only in very limited tissues. They are evolutionally young genes and thus should conduct functions only in specific tissues and conditions, limiting their occurrence in commonly studied biological models. Such functional annotations provide hints for functional investigations on the uPE1 proteins. This proteome-wide-scale functional annotation strategy is also applicable to any other species.

Subject(s)

Molecular Sequence Annotation , Proteome , Humans , Proteome/genetics , Proteome/metabolism , Proteome/analysis , Proteome/chemistry , Deep Learning , Sequence Alignment , Genome, Human , Proteomics/methods , Databases, Protein

19.

Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model.

Ke, Jinsong; Zhao, Jianmei; Li, Hongfei; Yuan, Lei; Dong, Guanghui; Wang, Guohua.

Comput Biol Med ; 174: 108330, 2024 May.

Article in English | MEDLINE | ID: mdl-38588617

ABSTRACT

N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets.

Subject(s)

Deep Learning , Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Proteins/chemistry , Proteins/metabolism , Databases, Protein , Humans , Algorithms

20.

Improvements in viral gene annotation using large language models and soft alignments.

Harrigan, William L; Ferrell, Barbra D; Wommack, K Eric; Polson, Shawn W; Schreiber, Zachary D; Belcaid, Mahdi.

BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.

Article in English | MEDLINE | ID: mdl-38664627

ABSTRACT

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.

Subject(s)

Algorithms , Molecular Sequence Annotation , Sequence Alignment , Molecular Sequence Annotation/methods , Sequence Alignment/methods , Viral Proteins/genetics , Viral Proteins/chemistry , Genes, Viral , Databases, Protein , Computational Biology/methods , Amino Acid Sequence

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL