Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 17.767
1.
Life Sci Alliance ; 7(7)2024 Jul.
Article En | MEDLINE | ID: mdl-38719747

The differential expression of plasma membrane proteins is integrally analyzed for their diagnosis, prognosis, and therapeutic applications in diverse clinical manifestations. Necessarily, distinct membrane protein enrichment methods and mass spectrometry platforms are employed for their global and relative quantitation. First of its kind to explore, we compiled membrane-associated proteomes in human and mouse systems into a database named, Resource of Experimental Membrane-Enriched Mass spectrometry-derived Proteome (REMEMProt). It currently hosts 14,626 proteins (9,507 proteins in Homo sapiens; 5,119 proteins in Mus musculus) with information on their membrane-protein enrichment methods, experimental/physiological context of detection in cells or tissues, transmembrane domain analysis, and their current attribution as biomarkers. Based on these annotations and the transmembrane domain analysis in proteins or their binary/complex protein-protein interactors, REMEMProt facilitates the assessment of the plasma membrane localization potential of proteins through batch query. A cross-study enrichment analysis platform is enabled in REMEMProt for comparative analysis of proteomes using novel/modified membrane enrichment methods and evaluation of methods for targeted enrichment of membrane proteins. REMEMProt data are made freely accessible to explore and download at https://rememprot.ciods.in/.


Biomarkers , Databases, Protein , Membrane Proteins , Proteome , Proteomics , Humans , Proteome/metabolism , Membrane Proteins/metabolism , Biomarkers/metabolism , Animals , Mice , Proteomics/methods , Cell Membrane/metabolism , Mass Spectrometry/methods
2.
Proc Natl Acad Sci U S A ; 121(21): e2400260121, 2024 May 21.
Article En | MEDLINE | ID: mdl-38743624

We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence coevolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI can be implemented on a proteome-wide scale and is applied here to millions of structural models of dimeric complexes in the Escherichia coli and human interactomes found in the PrePPI database. PrePPI's scoring function is based primarily on the evaluation of protein-protein interfaces, and ZEPPI adds a new feature to this analysis through the incorporation of evolutionary information. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. As we discuss, the standard CAPRI scores used to evaluate docking models are based on model quality and not on the ability to give yes/no answers as to whether two proteins interact. ZEPPI is able to detect weak signals from PPI models that the CAPRI scores define as incorrect and, similarly, to identify potential PPIs defined as low confidence by the current PrePPI scoring function. A number of examples that illustrate how the combination of PrePPI and ZEPPI can yield functional hypotheses are provided.


Proteome , Proteome/metabolism , Humans , Protein Interaction Mapping/methods , Models, Molecular , Escherichia coli/metabolism , Escherichia coli/genetics , Databases, Protein , Protein Binding , Escherichia coli Proteins/metabolism , Escherichia coli Proteins/chemistry , Escherichia coli Proteins/genetics , Proteins/chemistry , Proteins/metabolism , Sequence Alignment
3.
Sci Rep ; 14(1): 10527, 2024 05 08.
Article En | MEDLINE | ID: mdl-38719885

Plasmodium falciparum, the causative agent of malaria, poses a significant global health challenge, yet much of its biology remains elusive. A third of the genes in the P. falciparum genome lack annotations regarding their function, impeding our understanding of the parasite's biology. In this study, we employ structure predictions and the DALI search algorithm to analyse proteins encoded by uncharacterized genes in the reference strain 3D7 of P. falciparum. By comparing AlphaFold predictions to experimentally determined protein structures in the Protein Data Bank, we found similarities to known domains in 353 proteins of unknown function, shedding light on their potential functions. The lowest-scoring 5% of similarities were additionally validated using the size-independent TM-align algorithm, confirming the detected similarities in 88% of the cases. Notably, in over 70 P. falciparum proteins the presence of domains resembling heptatricopeptide repeats, which are typically involvement in RNA binding and processing, was detected. This suggests this family, which is important in transcription in mitochondria and apicoplasts, is much larger in Plasmodium parasites than previously thought. The results of this domain search provide a resource to the malaria research community that is expected to inform and enable experimental studies.


Plasmodium falciparum , Protozoan Proteins , Plasmodium falciparum/genetics , Plasmodium falciparum/metabolism , Protozoan Proteins/genetics , Protozoan Proteins/metabolism , Protozoan Proteins/chemistry , Algorithms , Protein Domains , Databases, Protein , Models, Molecular
4.
Nat Commun ; 15(1): 3956, 2024 May 10.
Article En | MEDLINE | ID: mdl-38730277

Immunopeptidomics is crucial for immunotherapy and vaccine development. Because the generation of immunopeptides from their parent proteins does not adhere to clear-cut rules, rather than being able to use known digestion patterns, every possible protein subsequence within human leukocyte antigen (HLA) class-specific length restrictions needs to be considered during sequence database searching. This leads to an inflation of the search space and results in lower spectrum annotation rates. Peptide-spectrum match (PSM) rescoring is a powerful enhancement of standard searching that boosts the spectrum annotation performance. We analyze 302,105 unique synthesized non-tryptic peptides from the ProteomeTools project on a timsTOF-Pro to generate a ground-truth dataset containing 93,227 MS/MS spectra of 74,847 unique peptides, that is used to fine-tune the deep learning-based fragment ion intensity prediction model Prosit. We demonstrate up to 3-fold improvement in the identification of immunopeptides, as well as increased detection of immunopeptides from low input samples.


Deep Learning , Peptides , Tandem Mass Spectrometry , Humans , Peptides/chemistry , Peptides/immunology , Tandem Mass Spectrometry/methods , Databases, Protein , Proteomics/methods , HLA Antigens/immunology , HLA Antigens/genetics , Software , Ions
5.
Nat Commun ; 15(1): 3922, 2024 May 09.
Article En | MEDLINE | ID: mdl-38724498

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.


Proteomics , Proteomics/methods , Workflow , Machine Learning , Proteome/metabolism , Humans , Algorithms , Databases, Protein
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Article En | MEDLINE | ID: mdl-38725156

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.


Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Computational Biology/methods , Databases, Protein , Software , Algorithms , Humans , Proteins/chemistry , Proteins/metabolism
7.
Bioinformatics ; 40(5)2024 May 02.
Article En | MEDLINE | ID: mdl-38718225

MOTIVATION: Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS: This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION: github.com/JudeWells/Chainsaw.


Algorithms , Neural Networks, Computer , Protein Domains , Proteins , Proteins/chemistry , Databases, Protein , Computational Biology/methods , Software , Humans
8.
Sci Data ; 11(1): 495, 2024 May 14.
Article En | MEDLINE | ID: mdl-38744964

Single amino acid substitutions can profoundly affect protein folding, dynamics, and function. The ability to discern between benign and pathogenic substitutions is pivotal for therapeutic interventions and research directions. Given the limitations in experimental examination of these variants, AlphaMissense has emerged as a promising predictor of the pathogenicity of missense variants. Since heterogenous performance on different types of proteins can be expected, we assessed the efficacy of AlphaMissense across several protein groups (e.g. soluble, transmembrane, and mitochondrial proteins) and regions (e.g. intramembrane, membrane interacting, and high confidence AlphaFold segments) using ClinVar data for validation. Our comprehensive evaluation showed that AlphaMissense delivers outstanding performance, with MCC scores predominantly between 0.6 and 0.74. We observed low performance on disordered datasets and ClinVar data related to the CFTR ABC protein. However, a superior performance was shown when benchmarked against the high quality CFTR2 database. Our results with CFTR emphasizes AlphaMissense's potential in pinpointing functional hot spots, with its performance likely surpassing benchmarks calculated from ClinVar and ProteinGym datasets.


Databases, Protein , Proteins , Humans , Amino Acid Substitution , Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Cystic Fibrosis Transmembrane Conductance Regulator/chemistry , Mutation, Missense , Protein Folding , Proteins/chemistry , Proteins/genetics
9.
Nat Commun ; 15(1): 4217, 2024 May 17.
Article En | MEDLINE | ID: mdl-38760359

Helix mimicry provides probes to perturb protein-protein interactions (PPIs). Helical conformations can be stabilized by joining side chains of non-terminal residues (stapling) or via capping fragments. Nature exclusively uses capping, but synthetic helical mimics are heavily biased towards stapling. This study comprises: (i) creation of a searchable database of unique helical N-caps (ASX motifs, a protein structural motif with two intramolecular hydrogen-bonds between aspartic acid/asparagine and following residues); (ii) testing trends observed in this database using linear peptides comprising only canonical L-amino acids; and, (iii) novel synthetic N-caps for helical interface mimicry. Here we show many natural ASX motifs comprise hydrophobic triangles, validate their effect in linear peptides, and further develop a biomimetic of them, Bicyclic ASX Motif Mimics (BAMMs). BAMMs are powerful helix inducing motifs. They are synthetically accessible, and potentially useful to a broad section of the community studying disruption of PPIs using secondary structure mimics.


Amino Acid Motifs , Computational Biology , Computational Biology/methods , Hydrogen Bonding , Peptides/chemistry , Peptides/metabolism , Hydrophobic and Hydrophilic Interactions , Protein Structure, Secondary , Models, Molecular , Amino Acid Sequence , Databases, Protein , Proteins/chemistry , Proteins/metabolism , Aspartic Acid/chemistry
10.
BMC Bioinformatics ; 25(1): 174, 2024 May 02.
Article En | MEDLINE | ID: mdl-38698340

BACKGROUND: In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS: We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION: The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .


Computational Biology , Proteins , Proteins/chemistry , Proteins/metabolism , Computational Biology/methods , Databases, Protein , Algorithms
11.
Brief Bioinform ; 25(3)2024 Mar 27.
Article En | MEDLINE | ID: mdl-38701416

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.


Algorithms , Computational Biology , Neural Networks, Computer , Protein Structure, Secondary , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Computational Biology/methods , Databases, Protein , Gene Ontology , Sequence Analysis, Protein/methods , Software
12.
Brief Bioinform ; 25(3)2024 Mar 27.
Article En | MEDLINE | ID: mdl-38695119

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Algorithms , Computational Biology , Sequence Alignment , Sequence Alignment/methods , Computational Biology/methods , Software , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Deep Learning , Databases, Protein
13.
Brief Bioinform ; 25(3)2024 Mar 27.
Article En | MEDLINE | ID: mdl-38706315

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning
14.
Brief Bioinform ; 25(3)2024 Mar 27.
Article En | MEDLINE | ID: mdl-38706316

Protein-ligand interactions (PLIs) are essential for cellular activities and drug discovery. But due to the complexity and high cost of experimental methods, there is a great demand for computational approaches to recognize PLI patterns, such as protein-ligand docking. In recent years, more and more models based on machine learning have been developed to directly predict the root mean square deviation (RMSD) of a ligand docking pose with reference to its native binding pose. However, new scoring methods are pressingly needed in methodology for more accurate RMSD prediction. We present a new deep learning-based scoring method for RMSD prediction of protein-ligand docking poses based on a Graphormer method and Shell-like graph architecture, named GSScore. To recognize near-native conformations from a set of poses, GSScore takes atoms as nodes and then establishes the docking interface of protein-ligand into multiple bipartite graphs within different shell ranges. Benefiting from the Graphormer and Shell-like graph architecture, GSScore can effectively capture the subtle differences between energetically favorable near-native conformations and unfavorable non-native poses without extra information. GSScore was extensively evaluated on diverse test sets including a subset of PDBBind version 2019, CASF2016 as well as DUD-E, and obtained significant improvements over existing methods in terms of RMSE, $R$ (Pearson correlation coefficient), Spearman correlation coefficient and Docking power.


Molecular Docking Simulation , Proteins , Ligands , Proteins/chemistry , Proteins/metabolism , Protein Binding , Software , Algorithms , Computational Biology/methods , Protein Conformation , Databases, Protein , Deep Learning
15.
Database (Oxford) ; 20242024 May 06.
Article En | MEDLINE | ID: mdl-38713861

Cancer immunotherapy has brought about a revolutionary breakthrough in the field of cancer treatment. Immunotherapy has changed the treatment landscape for a variety of solid and hematologic malignancies. To assist researchers in efficiently uncovering valuable information related to cancer immunotherapy, we have presented a manually curated comprehensive database called DIRMC, which focuses on molecular features involved in cancer immunotherapy. All the content was collected manually from published literature, authoritative clinical trial data submitted by clinicians, some databases for drug target prediction such as DrugBank, and some experimentally confirmed high-throughput data sets for the characterization of immune-related molecular interactions in cancer, such as a curated database of T-cell receptor sequences with known antigen specificity (VDJdb), a pathology-associated TCR database (McPAS-TCR) et al. By constructing a fully connected functional network, ranging from cancer-related gene mutations to target genes to translated target proteins to protein regions or sites that may specifically affect protein function, we aim to comprehensively characterize molecular features related to cancer immunotherapy. We have developed the scoring criteria to assess the reliability of each MHC-peptide-T-cell receptor (TCR) interaction item to provide a reference for users. The database provides a user-friendly interface to browse and retrieve data by genes, target proteins, diseases and more. DIRMC also provides a download and submission page for researchers to access data of interest for further investigation or submit new interactions related to cancer immunotherapy targets. Furthermore, DIRMC provides a graphical interface to help users predict the binding affinity between their own peptide of interest and MHC or TCR. This database will provide researchers with a one-stop resource to understand cancer immunotherapy-related targets as well as data on MHC-peptide-TCR interactions. It aims to offer reliable molecular characteristics support for both the analysis of the current status of cancer immunotherapy and the development of new immunotherapy. DIRMC is available at http://www.dirmc.tech/. Database URL: http://www.dirmc.tech/.


Immunotherapy , Neoplasms , Immunotherapy/methods , Humans , Neoplasms/immunology , Neoplasms/genetics , Neoplasms/therapy , Receptors, Antigen, T-Cell/immunology , Receptors, Antigen, T-Cell/genetics , Databases, Protein , User-Computer Interface
16.
BMC Bioinformatics ; 25(1): 176, 2024 May 04.
Article En | MEDLINE | ID: mdl-38704533

BACKGROUND: Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT: In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION: Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.


Proteins , Proteins/chemistry , Computational Biology/methods , Databases, Protein , Protein Conformation , Algorithms , Sequence Analysis, Protein/methods , Neural Networks, Computer
17.
Protein Sci ; 33(6): e4988, 2024 Jun.
Article En | MEDLINE | ID: mdl-38757367

Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.


Molecular Sequence Annotation , Protein Domains , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Databases, Protein , Computational Biology/methods , Gene Ontology , Humans , Software
18.
Curr Protoc ; 4(5): e1047, 2024 May.
Article En | MEDLINE | ID: mdl-38720559

Recent advancements in protein structure determination and especially in protein structure prediction techniques have led to the availability of vast amounts of macromolecular structures. However, the accessibility and integration of these structures into scientific workflows are hindered by the lack of standardization among publicly available data resources. To address this issue, we introduced the 3D-Beacons Network, a unified platform that aims to establish a standardized framework for accessing and displaying protein structure data. In this article, we highlight the importance of standardized approaches for accessing protein structure data and showcase the capabilities of 3D-Beacons. We describe four protocols for finding and accessing macromolecular structures from various specialist data resources via 3D-Beacons. First, we describe three scenarios for programmatically accessing and retrieving data using the 3D-Beacons API. Next, we show how to perform sequence-based searches to find structures from model providers. Then, we demonstrate how to search for structures and fetch them directly into a workflow using JalView. Finally, we outline the process of facilitating access to data from providers interested in contributing their structures to the 3D-Beacons Network. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Programmatic access to the 3D-Beacons API Basic Protocol 2: Sequence-based search using the 3D-Beacons API Basic Protocol 3: Accessing macromolecules from 3D-Beacons with JalView Basic Protocol 4: Enhancing data accessibility through 3D-Beacons.


Protein Conformation , Proteins , Proteins/chemistry , Databases, Protein , Software
19.
Int J Mol Sci ; 25(9)2024 Apr 28.
Article En | MEDLINE | ID: mdl-38732022

The molecular weight (MW) of an enzyme is a critical parameter in enzyme-constrained models (ecModels). It is determined by two factors: the presence of subunits and the abundance of each subunit. Although the number of subunits (NS) can potentially be obtained from UniProt, this information is not readily available for most proteins. In this study, we addressed this gap by extracting and curating subunit information from the UniProt database to establish a robust benchmark dataset. Subsequently, we propose a novel model named DeepSub, which leverages the protein language model and Bi-directional Gated Recurrent Unit (GRU), to predict NS in homo-oligomers solely based on protein sequences. DeepSub demonstrates remarkable accuracy, achieving an accuracy rate as high as 0.967, surpassing the performance of QUEEN. To validate the effectiveness of DeepSub, we performed predictions for protein homo-oligomers that have been reported in the literature but are not documented in the UniProt database. Examples include homoserine dehydrogenase from Corynebacterium glutamicum, Matrilin-4 from Mus musculus and Homo sapiens, and the Multimerins protein family from M. musculus and H. sapiens. The predicted results align closely with the reported findings in the literature, underscoring the reliability and utility of DeepSub.


Databases, Protein , Deep Learning , Protein Subunits , Protein Subunits/chemistry , Protein Subunits/metabolism , Animals , Humans , Protein Multimerization , Mice , Computational Biology/methods
20.
BMC Genomics ; 25(1): 466, 2024 May 13.
Article En | MEDLINE | ID: mdl-38741045

BACKGROUND: Protein-protein interactions (PPIs) hold significant importance in biology, with precise PPI prediction as a pivotal factor in comprehending cellular processes and facilitating drug design. However, experimental determination of PPIs is laborious, time-consuming, and often constrained by technical limitations. METHODS: We introduce a new node representation method based on initial information fusion, called FFANE, which amalgamates PPI networks and protein sequence data to enhance the precision of PPIs' prediction. A Gaussian kernel similarity matrix is initially established by leveraging protein structural resemblances. Concurrently, protein sequence similarities are gauged using the Levenshtein distance, enabling the capture of diverse protein attributes. Subsequently, to construct an initial information matrix, these two feature matrices are merged by employing weighted fusion to achieve an organic amalgamation of structural and sequence details. To gain a more profound understanding of the amalgamated features, a Stacked Autoencoder (SAE) is employed for encoding learning, thereby yielding more representative feature representations. Ultimately, classification models are trained to predict PPIs by using the well-learned fusion feature. RESULTS: When employing 5-fold cross-validation experiments on SVM, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets, respectively. CONCLUSION: Experimental findings across various authentic datasets validate the efficacy and superiority of this fusion feature representation approach, underscoring its potential value in bioinformatics.


Computational Biology , Protein Interaction Mapping , Protein Interaction Mapping/methods , Computational Biology/methods , Algorithms , Helicobacter pylori/metabolism , Helicobacter pylori/genetics , Support Vector Machine , Proteins/metabolism , Proteins/chemistry , Humans , Protein Interaction Maps , Databases, Protein
...