Pesquisa | Portal Regional da BVS

1.

GATSol, an enhanced predictor of protein solubility through the synergy of 3D structure graph and large language modeling.

Li, Bin; Ming, Dengming.

BMC Bioinformatics ; 25(1): 204, 2024 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-38824535

RESUMO

BACKGROUND: Protein solubility is a critically important physicochemical property closely related to protein expression. For example, it is one of the main factors to be considered in the design and production of antibody drugs and a prerequisite for realizing various protein functions. Although several solubility prediction models have emerged in recent years, many of these models are limited to capturing information embedded in one-dimensional amino acid sequences, resulting in unsatisfactory predictive performance. RESULTS: In this study, we introduce a novel Graph Attention network-based protein Solubility model, GATSol, which represents the 3D structure of proteins as a protein graph. In addition to the node features of amino acids extracted by the state-of-the-art protein large language model, GATSol utilizes amino acid distance maps generated using the latest AlphaFold technology. Rigorous testing on independent eSOL and the Saccharomyces cerevisiae test datasets has shown that GATSol outperforms most recently introduced models, especially with respect to the coefficient of determination R2, which reaches 0.517 and 0.424, respectively. It outperforms the current state-of-the-art GraphSol by 18.4% on the S. cerevisiae_test set. CONCLUSIONS: GATSol captures 3D dimensional features of proteins by building protein graphs, which significantly improves the accuracy of protein solubility prediction. Recent advances in protein structure modeling allow our method to incorporate spatial structure features extracted from predicted structures into the model by relying only on the input of protein sequences, which simplifies the entire graph neural network prediction process, making it more user-friendly and efficient. As a result, GATSol may help prioritize highly soluble proteins, ultimately reducing the cost and effort of experimental work. The source code and data of the GATSol model are freely available at https://github.com/binbinbinv/GATSol .

Assuntos

Proteínas , Solubilidade , Proteínas/química , Proteínas/metabolismo , Conformação Proteica , Bases de Dados de Proteínas , Biologia Computacional/métodos , Software , Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/química , Algoritmos , Modelos Moleculares , Sequência de Aminoácidos

2.

VISH-Pred: an ensemble of fine-tuned ESM models for protein toxicity prediction.

Mall, Raghvendra; Singh, Ankita; Patel, Chirag N; Guirimand, Gregory; Castiglione, Filippo.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38842509

RESUMO

Peptide- and protein-based therapeutics are becoming a promising treatment regimen for myriad diseases. Toxicity of proteins is the primary hurdle for protein-based therapies. Thus, there is an urgent need for accurate in silico methods for determining toxic proteins to filter the pool of potential candidates. At the same time, it is imperative to precisely identify non-toxic proteins to expand the possibilities for protein-based biologics. To address this challenge, we proposed an ensemble framework, called VISH-Pred, comprising models built by fine-tuning ESM2 transformer models on a large, experimentally validated, curated dataset of protein and peptide toxicities. The primary steps in the VISH-Pred framework are to efficiently estimate protein toxicities taking just the protein sequence as input, employing an under sampling technique to handle the humongous class-imbalance in the data and learning representations from fine-tuned ESM2 protein language models which are then fed to machine learning techniques such as Lightgbm and XGBoost. The VISH-Pred framework is able to correctly identify both peptides/proteins with potential toxicity and non-toxic proteins, achieving a Matthews correlation coefficient of 0.737, 0.716 and 0.322 and F1-score of 0.759, 0.696 and 0.713 on three non-redundant blind tests, respectively, outperforming other methods by over $10\%$ on these quality metrics. Moreover, VISH-Pred achieved the best accuracy and area under receiver operating curve scores on these independent test sets, highlighting the robustness and generalization capability of the framework. By making VISH-Pred available as an easy-to-use web server, we expect it to serve as a valuable asset for future endeavors aimed at discerning the toxicity of peptides and enabling efficient protein-based therapeutics.

Assuntos

Proteínas , Proteínas/metabolismo , Proteínas/química , Aprendizado de Máquina , Bases de Dados de Proteínas , Biologia Computacional/métodos , Humanos , Peptídeos/toxicidade , Peptídeos/química , Simulação por Computador , Algoritmos , Software

3.

Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering.

Barone, Federico; Russo, Elena Tea; Villegas Garcia, Edith Natalia; Punta, Marco; Cozzini, Stefano; Ansuini, Alessio; Cazzaniga, Alberto.

Sci Data ; 11(1): 568, 2024 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-38824125

RESUMO

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.

Assuntos

Proteoma , Humanos , Trato Gastrointestinal/metabolismo , Análise por Conglomerados , Anotação de Sequência Molecular , Metagenômica , Bases de Dados de Proteínas

4.

Folding the human proteome using BioNeMo: A fused dataset of structural models for machine learning purposes.

Hetmann, Michael; Parigger, Lena; Sirelkhatim, Hassan; Stern, Abraham; Krassnigg, Andreas; Gruber, Karl; Steinkellner, Georg; Ruau, David; Gruber, Christian C.

Sci Data ; 11(1): 591, 2024 Jun 06.

Artigo em Inglês | MEDLINE | ID: mdl-38844754

RESUMO

Human proteins are crucial players in both health and disease. Understanding their molecular landscape is a central topic in biological research. Here, we present an extensive dataset of predicted protein structures for 42,042 distinct human proteins, including splicing variants, derived from the UniProt reference proteome UP000005640. To ensure high quality and comparability, the dataset was generated by combining state-of-the-art modeling-tools AlphaFold 2, OpenFold, and ESMFold, provided within NVIDIA's BioNeMo platform, as well as homology modeling using Innophore's CavitomiX platform. Our dataset is offered in both unedited and edited formats for diverse research requirements. The unedited version contains structures as generated by the different prediction methods, whereas the edited version contains refinements, including a dataset of structures without low prediction-confidence regions and structures in complex with predicted ligands based on homologs in the PDB. We are confident that this dataset represents the most comprehensive collection of human protein structures available today, facilitating diverse applications such as structure-based drug design and the prediction of protein function and interactions.

Assuntos

Aprendizado de Máquina , Proteoma , Humanos , Dobramento de Proteína , Bases de Dados de Proteínas , Conformação Proteica , Modelos Moleculares

5.

MechanoProDB: a web-based database for exploring the mechanical properties of proteins.

Mesbah, Ismahene; Habermann, Bianca; Rico, Felix.

Database (Oxford) ; 20242024 Jun 05.

Artigo em Inglês | MEDLINE | ID: mdl-38837788

RESUMO

The mechanical stability of proteins is crucial for biological processes. To understand the mechanical functions of proteins, it is important to know the protein structure and mechanical properties. Protein mechanics is usually investigated through force spectroscopy experiments and simulations that probe the forces required to unfold the protein of interest. While there is a wealth of data in the literature on force spectroscopy experiments and steered molecular dynamics simulations of forced protein unfolding, this information is spread and difficult to access by non-experts. Here, we introduce MechanoProDB, a novel web-based database resource for collecting and mining data obtained from experimental and computational works. MechanoProDB provides a curated repository for a wide range of proteins, including muscle proteins, adhesion molecules and membrane proteins. The database incorporates relevant parameters that provide insights into the mechanical stability of proteins and their conformational stability such as the unfolding forces, energy landscape parameters and contour lengths of unfolding steps. Additionally, it provides intuitive annotations of the unfolding pathways of each protein, allowing users to explore the individual steps during mechanical unfolding. The user-friendly interface of MechanoProDB allows researchers to efficiently navigate, search and download data pertaining to specific protein folds or experimental conditions. Users can visualize protein structures using interactive tools integrated within the database, such as Mol*, and plot available data through integrated plotting tools. To ensure data quality and reliability, we have carefully manually verified and curated the data currently available on MechanoProDB. Furthermore, the database also features an interface that enables users to contribute new data and annotations, promoting community-driven comprehensiveness. The freely available MechanoProDB aims to streamline and accelerate research in the field of mechanobiology and biophysics by offering a unique platform for data sharing and analysis. MechanoProDB is freely available at https://mechanoprodb.ibdm.univ-amu.fr.

Assuntos

Bases de Dados de Proteínas , Internet , Proteínas , Proteínas/química , Proteínas/metabolismo , Interface Usuário-Computador , Desdobramento de Proteína

6.

Identification of domains in Plasmodium falciparum proteins of unknown function using DALI search on AlphaFold predictions.

Behrens, Hannah Michaela; Spielmann, Tobias.

Sci Rep ; 14(1): 10527, 2024 05 08.

Artigo em Inglês | MEDLINE | ID: mdl-38719885

RESUMO

Plasmodium falciparum, the causative agent of malaria, poses a significant global health challenge, yet much of its biology remains elusive. A third of the genes in the P. falciparum genome lack annotations regarding their function, impeding our understanding of the parasite's biology. In this study, we employ structure predictions and the DALI search algorithm to analyse proteins encoded by uncharacterized genes in the reference strain 3D7 of P. falciparum. By comparing AlphaFold predictions to experimentally determined protein structures in the Protein Data Bank, we found similarities to known domains in 353 proteins of unknown function, shedding light on their potential functions. The lowest-scoring 5% of similarities were additionally validated using the size-independent TM-align algorithm, confirming the detected similarities in 88% of the cases. Notably, in over 70 P. falciparum proteins the presence of domains resembling heptatricopeptide repeats, which are typically involvement in RNA binding and processing, was detected. This suggests this family, which is important in transcription in mitochondria and apicoplasts, is much larger in Plasmodium parasites than previously thought. The results of this domain search provide a resource to the malaria research community that is expected to inform and enable experimental studies.

Assuntos

Plasmodium falciparum , Proteínas de Protozoários , Plasmodium falciparum/genética , Plasmodium falciparum/metabolismo , Proteínas de Protozoários/genética , Proteínas de Protozoários/metabolismo , Proteínas de Protozoários/química , Algoritmos , Domínios Proteicos , Bases de Dados de Proteínas , Modelos Moleculares

7.

Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.

Peng, Hui; Wang, He; Kong, Weijia; Li, Jinyan; Goh, Wilson Wen Bin.

Nat Commun ; 15(1): 3922, 2024 May 09.

Artigo em Inglês | MEDLINE | ID: mdl-38724498

RESUMO

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Assuntos

Proteômica , Proteômica/métodos , Fluxo de Trabalho , Aprendizado de Máquina , Proteoma/metabolismo , Humanos , Algoritmos , Bases de Dados de Proteínas

8.

TransPTM: a transformer-based model for non-histone acetylation site prediction.

Meng, Lingkuan; Chen, Xingjian; Cheng, Ke; Chen, Nanjun; Zheng, Zetian; Wang, Fuzhou; Sun, Hongyan; Wong, Ka-Chun.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38725156

RESUMO

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.

Assuntos

Redes Neurais de Computação , Processamento de Proteína Pós-Traducional , Acetilação , Biologia Computacional/métodos , Bases de Dados de Proteínas , Software , Algoritmos , Humanos , Proteínas/química , Proteínas/metabolismo

9.

REMEMProt: a resource of membrane-enriched proteome profiles, their disease associations, and biomarker status.

Aravind, Anjana; Nandakumar, Revathy; Ahmed, Mukhtar; Nisar, Mahammad; Palollathil, Akhina; Kanichery, Anagha; Sreelan, Sourav; Sinan, Kp Munavvar; Balaya, Rex Devasahayam Arokia; Vijayakumar, Manavalan; Prasad, Thottethodi Subrahmanya Keshava; Raju, Rajesh.

Life Sci Alliance ; 7(7)2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-38719747

RESUMO

The differential expression of plasma membrane proteins is integrally analyzed for their diagnosis, prognosis, and therapeutic applications in diverse clinical manifestations. Necessarily, distinct membrane protein enrichment methods and mass spectrometry platforms are employed for their global and relative quantitation. First of its kind to explore, we compiled membrane-associated proteomes in human and mouse systems into a database named, Resource of Experimental Membrane-Enriched Mass spectrometry-derived Proteome (REMEMProt). It currently hosts 14,626 proteins (9,507 proteins in Homo sapiens; 5,119 proteins in Mus musculus) with information on their membrane-protein enrichment methods, experimental/physiological context of detection in cells or tissues, transmembrane domain analysis, and their current attribution as biomarkers. Based on these annotations and the transmembrane domain analysis in proteins or their binary/complex protein-protein interactors, REMEMProt facilitates the assessment of the plasma membrane localization potential of proteins through batch query. A cross-study enrichment analysis platform is enabled in REMEMProt for comparative analysis of proteomes using novel/modified membrane enrichment methods and evaluation of methods for targeted enrichment of membrane proteins. REMEMProt data are made freely accessible to explore and download at https://rememprot.ciods.in/.

Assuntos

Biomarcadores , Bases de Dados de Proteínas , Proteínas de Membrana , Proteoma , Proteômica , Humanos , Proteoma/metabolismo , Proteínas de Membrana/metabolismo , Biomarcadores/metabolismo , Animais , Camundongos , Proteômica/métodos , Membrana Celular/metabolismo , Espectrometria de Massas/métodos

10.

Protein function prediction through multi-view multi-label latent tensor reconstruction.

Armah-Sekum, Robert Ebo; Szedmak, Sandor; Rousu, Juho.

BMC Bioinformatics ; 25(1): 174, 2024 May 02.

Artigo em Inglês | MEDLINE | ID: mdl-38698340

RESUMO

BACKGROUND: In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS: We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION: The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .

Assuntos

Biologia Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Bases de Dados de Proteínas , Algoritmos

11.

DeepSS2GO: protein function prediction from secondary structure.

Song, Fu V; Su, Jiaqi; Huang, Sixing; Zhang, Neng; Li, Kaiyue; Ni, Ming; Liao, Maofu.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38701416

RESUMO

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.

Assuntos

Algoritmos , Biologia Computacional , Redes Neurais de Computação , Estrutura Secundária de Proteína , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas , Ontologia Genética , Análise de Sequência de Proteína/métodos , Software

12.

Scoring alignments by embedding vector similarity.

Ashrafzadeh, Sepehr; Golding, G Brian; Ilie, Silvana; Ilie, Lucian.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38695119

RESUMO

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.

Assuntos

Algoritmos , Biologia Computacional , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Software , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizado Profundo , Bases de Dados de Proteínas

13.

Freeprotmap: waiting-free prediction method for protein distance map.

Huang, Jiajian; Li, Jinpeng; Chen, Qinchang; Wang, Xia; Chen, Guangyong; Tang, Jin.

BMC Bioinformatics ; 25(1): 176, 2024 May 04.

Artigo em Inglês | MEDLINE | ID: mdl-38704533

RESUMO

BACKGROUND: Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT: In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION: Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.

Assuntos

Proteínas , Proteínas/química , Biologia Computacional/métodos , Bases de Dados de Proteínas , Conformação Proteica , Algoritmos , Análise de Sequência de Proteína/métodos , Redes Neurais de Computação

14.

DIRMC: a database of immunotherapy-related molecular characteristics.

Liu, Yue; Zhou, Yuhuan; Hu, Xiumei; Le-Ge, Wuri; Wang, Haoyan; Jiang, Tao; Li, Junyi; Hu, Yang; Wang, Yadong.

Database (Oxford) ; 20242024 May 06.

Artigo em Inglês | MEDLINE | ID: mdl-38713861

RESUMO

Cancer immunotherapy has brought about a revolutionary breakthrough in the field of cancer treatment. Immunotherapy has changed the treatment landscape for a variety of solid and hematologic malignancies. To assist researchers in efficiently uncovering valuable information related to cancer immunotherapy, we have presented a manually curated comprehensive database called DIRMC, which focuses on molecular features involved in cancer immunotherapy. All the content was collected manually from published literature, authoritative clinical trial data submitted by clinicians, some databases for drug target prediction such as DrugBank, and some experimentally confirmed high-throughput data sets for the characterization of immune-related molecular interactions in cancer, such as a curated database of T-cell receptor sequences with known antigen specificity (VDJdb), a pathology-associated TCR database (McPAS-TCR) et al. By constructing a fully connected functional network, ranging from cancer-related gene mutations to target genes to translated target proteins to protein regions or sites that may specifically affect protein function, we aim to comprehensively characterize molecular features related to cancer immunotherapy. We have developed the scoring criteria to assess the reliability of each MHC-peptide-T-cell receptor (TCR) interaction item to provide a reference for users. The database provides a user-friendly interface to browse and retrieve data by genes, target proteins, diseases and more. DIRMC also provides a download and submission page for researchers to access data of interest for further investigation or submit new interactions related to cancer immunotherapy targets. Furthermore, DIRMC provides a graphical interface to help users predict the binding affinity between their own peptide of interest and MHC or TCR. This database will provide researchers with a one-stop resource to understand cancer immunotherapy-related targets as well as data on MHC-peptide-TCR interactions. It aims to offer reliable molecular characteristics support for both the analysis of the current status of cancer immunotherapy and the development of new immunotherapy. DIRMC is available at http://www.dirmc.tech/. Database URL: http://www.dirmc.tech/.

Assuntos

Imunoterapia , Neoplasias , Imunoterapia/métodos , Humanos , Neoplasias/imunologia , Neoplasias/genética , Neoplasias/terapia , Receptores de Antígenos de Linfócitos T/imunologia , Receptores de Antígenos de Linfócitos T/genética , Bases de Dados de Proteínas , Interface Usuário-Computador

15.

Evaluating large language models for annotating proteins.

Vitale, Rosario; Bugnon, Leandro A; Fenoy, Emilio Luis; Milone, Diego H; Stegmayer, Georgina.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38706315

RESUMO

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.

Assuntos

Bases de Dados de Proteínas , Proteínas , Proteínas/química , Anotação de Sequência Molecular/métodos , Biologia Computacional/métodos , Aprendizado de Máquina

16.

GSScore: a novel Graphormer-based shell-like scoring method for protein-ligand docking.

Guo, Linyuan; Wang, Jianxin.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38706316

RESUMO

Protein-ligand interactions (PLIs) are essential for cellular activities and drug discovery. But due to the complexity and high cost of experimental methods, there is a great demand for computational approaches to recognize PLI patterns, such as protein-ligand docking. In recent years, more and more models based on machine learning have been developed to directly predict the root mean square deviation (RMSD) of a ligand docking pose with reference to its native binding pose. However, new scoring methods are pressingly needed in methodology for more accurate RMSD prediction. We present a new deep learning-based scoring method for RMSD prediction of protein-ligand docking poses based on a Graphormer method and Shell-like graph architecture, named GSScore. To recognize near-native conformations from a set of poses, GSScore takes atoms as nodes and then establishes the docking interface of protein-ligand into multiple bipartite graphs within different shell ranges. Benefiting from the Graphormer and Shell-like graph architecture, GSScore can effectively capture the subtle differences between energetically favorable near-native conformations and unfavorable non-native poses without extra information. GSScore was extensively evaluated on diverse test sets including a subset of PDBBind version 2019, CASF2016 as well as DUD-E, and obtained significant improvements over existing methods in terms of RMSE, $R$ (Pearson correlation coefficient), Spearman correlation coefficient and Docking power.

Assuntos

Simulação de Acoplamento Molecular , Proteínas , Ligantes , Proteínas/química , Proteínas/metabolismo , Ligação Proteica , Software , Algoritmos , Biologia Computacional/métodos , Conformação Proteica , Bases de Dados de Proteínas , Aprendizado Profundo

17.

Bioinformatics leading to conveniently accessible, helix enforcing, bicyclic ASX motif mimics (BAMMs).

Mi, Tianxiong; Nguyen, Duyen; Gao, Zhe; Burgess, Kevin.

Nat Commun ; 15(1): 4217, 2024 May 17.

Artigo em Inglês | MEDLINE | ID: mdl-38760359

RESUMO

Helix mimicry provides probes to perturb protein-protein interactions (PPIs). Helical conformations can be stabilized by joining side chains of non-terminal residues (stapling) or via capping fragments. Nature exclusively uses capping, but synthetic helical mimics are heavily biased towards stapling. This study comprises: (i) creation of a searchable database of unique helical N-caps (ASX motifs, a protein structural motif with two intramolecular hydrogen-bonds between aspartic acid/asparagine and following residues); (ii) testing trends observed in this database using linear peptides comprising only canonical L-amino acids; and, (iii) novel synthetic N-caps for helical interface mimicry. Here we show many natural ASX motifs comprise hydrophobic triangles, validate their effect in linear peptides, and further develop a biomimetic of them, Bicyclic ASX Motif Mimics (BAMMs). BAMMs are powerful helix inducing motifs. They are synthetically accessible, and potentially useful to a broad section of the community studying disruption of PPIs using secondary structure mimics.

Assuntos

Motivos de Aminoácidos , Biologia Computacional , Biologia Computacional/métodos , Ligação de Hidrogênio , Peptídeos/química , Peptídeos/metabolismo , Interações Hidrofóbicas e Hidrofílicas , Estrutura Secundária de Proteína , Modelos Moleculares , Sequência de Aminoácidos , Bases de Dados de Proteínas , Proteínas/química , Proteínas/metabolismo , Ácido Aspártico/química

18.

Analysis of AlphaMissense data in different protein groups and structural context.

Tordai, Hedvig; Torres, Odalys; Csepi, Máté; Padányi, Rita; Lukács, Gergely L; Hegedus, Tamás.

Sci Data ; 11(1): 495, 2024 May 14.

Artigo em Inglês | MEDLINE | ID: mdl-38744964

RESUMO

Single amino acid substitutions can profoundly affect protein folding, dynamics, and function. The ability to discern between benign and pathogenic substitutions is pivotal for therapeutic interventions and research directions. Given the limitations in experimental examination of these variants, AlphaMissense has emerged as a promising predictor of the pathogenicity of missense variants. Since heterogenous performance on different types of proteins can be expected, we assessed the efficacy of AlphaMissense across several protein groups (e.g. soluble, transmembrane, and mitochondrial proteins) and regions (e.g. intramembrane, membrane interacting, and high confidence AlphaFold segments) using ClinVar data for validation. Our comprehensive evaluation showed that AlphaMissense delivers outstanding performance, with MCC scores predominantly between 0.6 and 0.74. We observed low performance on disordered datasets and ClinVar data related to the CFTR ABC protein. However, a superior performance was shown when benchmarked against the high quality CFTR2 database. Our results with CFTR emphasizes AlphaMissense's potential in pinpointing functional hot spots, with its performance likely surpassing benchmarks calculated from ClinVar and ProteinGym datasets.

Assuntos

Bases de Dados de Proteínas , Proteínas , Humanos , Substituição de Aminoácidos , Regulador de Condutância Transmembrana em Fibrose Cística/genética , Regulador de Condutância Transmembrana em Fibrose Cística/química , Mutação de Sentido Incorreto , Dobramento de Proteína , Proteínas/química , Proteínas/genética

19.

Mutual annotation-based prediction of protein domain functions with Domain2GO.

Ulusoy, Erva; Dogan, Tunca.

Protein Sci ; 33(6): e4988, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38757367

RESUMO

Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.

Assuntos

Anotação de Sequência Molecular , Domínios Proteicos , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Bases de Dados de Proteínas , Biologia Computacional/métodos , Ontologia Genética , Humanos , Software

20.

Peptriever: a Bi-Encoder approach for large-scale protein-peptide binding search.

Gurvich, Roni; Markel, Gal; Tanoli, Ziaurrehman; Meirson, Tomer.

Bioinformatics ; 40(5)2024 May 02.

Artigo em Inglês | MEDLINE | ID: mdl-38710496

RESUMO

MOTIVATION: Peptide therapeutics hinge on the precise interaction between a tailored peptide and its designated receptor while mitigating interactions with alternate receptors is equally indispensable. Existing methods primarily estimate the binding score between protein and peptide pairs. However, for a specific peptide without a corresponding protein, it is challenging to identify the proteins it could bind due to the sheer number of potential candidates. RESULTS: We propose a transformers-based protein embedding scheme in this study that can quickly identify and rank millions of interacting proteins. Furthermore, the proposed approach outperforms existing sequence- and structure-based methods, with a mean AUC-ROC and AUC-PR of 0.73. AVAILABILITY AND IMPLEMENTATION: Training data, scripts, and fine-tuned parameters are available at https://github.com/RoniGurvich/Peptriever. The proposed method is linked with a web application available for customized prediction at https://peptriever.app/.

Assuntos

Peptídeos , Ligação Proteica , Proteínas , Software , Peptídeos/química , Peptídeos/metabolismo , Proteínas/química , Proteínas/metabolismo , Algoritmos , Biologia Computacional/métodos , Bases de Dados de Proteínas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA