Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 17.898
Filtrar
1.
Nat Commun ; 15(1): 6601, 2024 Aug 04.
Artigo em Inglês | MEDLINE | ID: mdl-39097570

RESUMO

Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.


Assuntos
Biologia Computacional , Proteínas , Proteínas/metabolismo , Proteínas/química , Biologia Computacional/métodos , Aprendizado Profundo , Bases de Dados de Proteínas , Algoritmos , Humanos
2.
PLoS One ; 19(8): e0308425, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39106255

RESUMO

Ligand binding site prediction is a crucial initial step in structure-based drug discovery. Although several methods have been proposed previously, including those using geometry based and machine learning techniques, their accuracy is considered to be still insufficient. In this study, we introduce an approach that leverages a graph transformer neural network to rank the results of a geometry-based pocket detection method. We also created a larger training dataset compared to the conventionally used sc-PDB and investigated the correlation between the dataset size and prediction performance. Our findings indicate that utilizing a graph transformer-based method alongside a larger training dataset could enhance the performance of ligand binding site prediction.


Assuntos
Redes Neurais de Computação , Ligantes , Sítios de Ligação , Proteínas/química , Proteínas/metabolismo , Aprendizado de Máquina , Ligação Proteica , Descoberta de Drogas/métodos , Algoritmos , Bases de Dados de Proteínas
3.
Nat Commun ; 15(1): 6699, 2024 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-39107330

RESUMO

Post-translational modifications (PTMs) are pivotal in modulating protein functions and influencing cellular processes like signaling, localization, and degradation. The complexity of these biological interactions necessitates efficient predictive methodologies. In this work, we introduce PTMGPT2, an interpretable protein language model that utilizes prompt-based fine-tuning to improve its accuracy in precisely predicting PTMs. Drawing inspiration from recent advancements in GPT-based architectures, PTMGPT2 adopts unsupervised learning to identify PTMs. It utilizes a custom prompt to guide the model through the subtle linguistic patterns encoded in amino acid sequences, generating tokens indicative of PTM sites. To provide interpretability, we visualize attention profiles from the model's final decoder layer to elucidate sequence motifs essential for molecular recognition and analyze the effects of mutations at or near PTM sites to offer deeper insights into protein functionality. Comparative assessments reveal that PTMGPT2 outperforms existing methods across 19 PTM types, underscoring its potential in identifying disease associations and drug targets.


Assuntos
Processamento de Proteína Pós-Traducional , Proteínas/metabolismo , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos , Humanos , Biologia Computacional/métodos , Algoritmos , Bases de Dados de Proteínas
4.
Biotechnol J ; 19(8): e2400203, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39115336

RESUMO

Through iterative rounds of mutation and selection, proteins can be engineered to enhance their desired biological functions. Nevertheless, identifying optimal mutation sites for directed evolution remains challenging due to the vastness of the protein sequence landscape and the epistatic mutational effects across residues. To address this challenge, we introduce MLSmut, a deep learning-based approach that leverages multi-level structural features of proteins. MLSmut extracts salient information from protein co-evolution, sequence semantics, and geometric features to predict the mutational effect. Extensive benchmark evaluations on 10 single-site and two multi-site deep mutation scanning datasets demonstrate that MLSmut surpasses existing methods in predicting mutational outcomes. To overcome the limited training data availability, we employ a two-stage training strategy: initial coarse-tuning on a large corpus of unlabeled protein data followed by fine-tuning on a curated dataset of 40-100 experimental measurements. This approach enables our model to achieve satisfactory performance on downstream protein prediction tasks. Importantly, our model holds the potential to predict the mutational effects of any protein sequence. Collectively, these findings suggest that our approach can substantially reduce the reliance on laborious wet lab experiments and deepen our understanding of the intricate relationships between mutations and protein function.


Assuntos
Aprendizado Profundo , Mutação , Proteínas , Proteínas/genética , Proteínas/química , Biologia Computacional/métodos , Bases de Dados de Proteínas , Engenharia de Proteínas/métodos
5.
Protein Sci ; 33(9): e5097, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39145402

RESUMO

Disulfide bonds, covalently formed by sulfur atoms in cysteine residues, play a crucial role in protein folding and structure stability. Considering their significance, artificial disulfide bonds are often introduced to enhance protein thermostability. Although an increasing number of tools can assist with this task, significant amounts of time and resources are often wasted owing to inadequate consideration. To enhance the accuracy and efficiency of designing disulfide bonds for protein thermostability improvement, we initially collected disulfide bond and protein thermostability data from extensive literature sources. Thereafter, we extracted various sequence- and structure-based features and constructed machine-learning models to predict whether disulfide bonds can improve protein thermostability. Among all models, the neighborhood context model based on the Adaboost-DT algorithm performed the best, yielding "area under the receiver operating characteristic curve" and accuracy scores of 0.773 and 0.714, respectively. Furthermore, we also found AlphaFold2 to exhibit high superiority in predicting disulfide bonds, and to some extent, the coevolutionary relationship between residue pairs potentially guided artificial disulfide bond design. Moreover, several mutants of imine reductase 89 (IR89) with artificially designed thermostable disulfide bonds were experimentally proven to be considerably efficient for substrate catalysis. The SS-bond data have been integrated into an online server, namely, ThermoLink, available at guolab.mpu.edu.mo/thermoLink.


Assuntos
Dissulfetos , Aprendizado de Máquina , Dissulfetos/química , Bases de Dados de Proteínas , Estabilidade Enzimática , Modelos Moleculares , Dobramento de Proteína
6.
Protein Sci ; 33(9): e5140, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39145441

RESUMO

Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.


Assuntos
Algoritmos , Bases de Dados de Proteínas , Proteínas , Proteínas/química , Proteínas/metabolismo , Análise por Conglomerados , Biologia Computacional/métodos , Domínios Proteicos
8.
J Chem Inf Model ; 64(15): 6216-6229, 2024 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-39092854

RESUMO

The critical importance of accurately predicting mutations in protein metal-binding sites for advancing drug discovery and enhancing disease diagnostic processes cannot be overstated. In response to this imperative, MetalTrans emerges as an accurate predictor for disease-associated mutations in protein metal-binding sites. The core innovation of MetalTrans lies in its seamless integration of multifeature splicing with the Transformer framework, a strategy that ensures exhaustive feature extraction. Central to MetalTrans's effectiveness is its deep feature combination strategy, which merges evolutionary-scale modeling amino acid embeddings with ProtTrans embeddings, thus shedding light on the biochemical properties of proteins. Employing the Transformer component, MetalTrans leverages the self-attention mechanism to delve into higher-level representations. Utilizing mutation site information for feature fusion not only enriches the feature set but also sidesteps the common pitfall of overestimation linked to protein sequence-based predictions. This nuanced approach to feature fusion is a key differentiator, enabling MetalTrans to outperform existing methods significantly, as evidenced by comparative analyses. Our evaluations across varied metal binding site data sets (specifically Zn, Ca, Mg, and Mix) underscore MetalTrans's superior performance, which achieved the average AUC values of 0.971, 0.965, 0.980, and 0.945 on multiple 5-fold cross-validation, respectively. Remarkably, against the multichannel convolutional neural network method on a benchmark independent test set, MetalTrans demonstrated unparalleled robustness and superiority, boasting the AUC score of 0.998 on multiple 5-fold cross-validation. Our comprehensive examination of the predicted outcomes further confirms the effectiveness of the model. The source codes, data sets, and prediction results for MetalTrans can be accessed for academic usage at https://github.com/EduardWang/MetalTrans.


Assuntos
Metais , Mutação , Sítios de Ligação , Metais/química , Metais/metabolismo , Humanos , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Modelos Moleculares , Biologia Computacional/métodos , Bases de Dados de Proteínas
9.
Database (Oxford) ; 2024: 0, 2024 Aug 08.
Artigo em Inglês | MEDLINE | ID: mdl-39126203

RESUMO

A structural alteration in copper/zinc superoxide dismutase (SOD1) is one of the common features caused by amyotrophic lateral sclerosis (ALS)-linked mutations. Although a large number of SOD1 variants have been reported in ALS patients, the detailed structural properties of each variant are not well summarized. We present SoDCoD, a database of superoxide dismutase conformational diversity, collecting our comprehensive biochemical analyses of the structural changes in SOD1 caused by ALS-linked gene mutations and other perturbations. SoDCoD version 1.0 contains information about the properties of 188 types of SOD1 mutants, including structural changes and their binding to Derlin-1, as well as a set of genes contributing to the proteostasis of mutant-like wild-type SOD1. This database provides valuable insights into the diagnosis and treatment of ALS, particularly by targeting conformational alterations in SOD1. Database URL: https://fujisawagroup.github.io/SoDCoDweb/.


Assuntos
Esclerose Lateral Amiotrófica , Mutação , Superóxido Dismutase-1 , Esclerose Lateral Amiotrófica/genética , Esclerose Lateral Amiotrófica/enzimologia , Humanos , Superóxido Dismutase-1/genética , Superóxido Dismutase-1/química , Superóxido Dismutase-1/metabolismo , Bases de Dados de Proteínas , Conformação Proteica , Bases de Dados Genéticas , Superóxido Dismutase/genética , Superóxido Dismutase/química , Superóxido Dismutase/metabolismo
10.
BMC Bioinformatics ; 25(1): 256, 2024 Aug 04.
Artigo em Inglês | MEDLINE | ID: mdl-39098908

RESUMO

BACKGROUND: Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins. METHODS: In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model. RESULTS: Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98. CONCLUSION: Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design.


Assuntos
Antioxidantes , Proteínas , Antioxidantes/química , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Aprendizado de Máquina , Algoritmos , Análise de Ondaletas , Máquina de Vetores de Suporte , Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica
11.
Nat Commun ; 15(1): 6510, 2024 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-39095347

RESUMO

Shotgun proteomics analysis presents multifaceted challenges, demanding diverse tool integration for insights. Addressing this complexity, OmicScope emerges as an innovative solution for quantitative proteomics data analysis. Engineered to handle various data formats, it performs data pre-processing - including joining replicates, normalization, data imputation - and conducts differential proteomics analysis for both static and longitudinal experimental designs. Empowered by Enrichr with over 224 databases, OmicScope performs Over Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA). Additionally, its Nebula module facilitates meta-analysis from independent datasets, providing a systems biology approach for enriched insights. Complete with a data visualization toolkit and accessible as Python package and a web application, OmicScope democratizes proteomics analysis, offering an efficient and high-quality pipeline for researchers.


Assuntos
Proteômica , Software , Proteômica/métodos , Biologia de Sistemas/métodos , Humanos , Bases de Dados de Proteínas , Biologia Computacional/métodos
12.
Proc Natl Acad Sci U S A ; 121(34): e2315005121, 2024 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-39133858

RESUMO

The process of protein phase separation into liquid condensates has been implicated in the formation of membraneless organelles (MLOs), which selectively concentrate biomolecules to perform essential cellular functions. Although the importance of this process in health and disease is increasingly recognized, the experimental identification of proteins forming MLOs remains a complex challenge. In this study, we addressed this problem by harnessing the power of AlphaFold2 to perform computational predictions of the conformational properties of proteins from their amino acid sequences. We thus developed the CoDropleT (co-condensation into droplet transformer) method of predicting the propensity of co-condensation of protein pairs. The method was trained by combining experimental datasets of co-condensing proteins from the CD-CODE database with curated negative datasets of non-co-condensing proteins. To illustrate the performance of the method, we applied it to estimate the propensity of proteins to co-condense into MLOs. Our results suggest that CoDropleT could facilitate functional and therapeutic studies on protein condensation by predicting the composition of protein condensates.


Assuntos
Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Organelas/metabolismo , Conformação Proteica , Bases de Dados de Proteínas , Sequência de Aminoácidos
13.
Bioinformatics ; 40(8)2024 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-39078213

RESUMO

SUMMARY: Hydrogen-Deuterium Exchange Mass Spectrometry (HDX-MS) is a powerful protein characterization technique that provides insights into protein dynamics and flexibility at the peptide level. However, analyzing HDX-MS data presents a significant challenge due to the wealth of information it generates. Each experiment produces data for hundreds of peptides, often measured in triplicate across multiple time points. Comparisons between different protein states create distinct datasets containing thousands of peptides that require matching, rigorous statistical evaluation, and visualization. Our open-source R package, HDXBoxeR, is a comprehensive tool designed to facilitate statistical analysis and comparison of multiple sets among samples and time points for different protein states, along with data visualization. AVAILABILITY AND IMPLEMENTATION: HDXBoxeR is accessible as the R package (https://cran.r-project.org/web//packages/HDXBoxeR) and GitHub: mkajano/HDXBoxeR.


Assuntos
Espectrometria de Massa com Troca Hidrogênio-Deutério , Proteínas , Software , Proteínas/química , Espectrometria de Massa com Troca Hidrogênio-Deutério/métodos , Peptídeos/química , Bases de Dados de Proteínas , Espectrometria de Massas/métodos
14.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39073830

RESUMO

The annotation of enzyme function is a fundamental challenge in industrial biotechnology and pathologies. Numerous computational methods have been proposed to predict enzyme function by annotating enzyme labels with Enzyme Commission number. However, the existing methods face difficulties in modelling the hierarchical structure of enzyme label in a global view. Moreover, they haven't gone entirely to leverage the mutual interactions between different levels of enzyme label. In this paper, we formulate the hierarchy of enzyme label as a directed enzyme graph and propose a hierarchy-GCN (Graph Convolutional Network) encoder to globally model enzyme label dependency on the enzyme graph. Based on the enzyme hierarchy encoder, we develop an end-to-end hierarchical-aware global model named GloEC to predict enzyme function. GloEC learns hierarchical-aware enzyme label embeddings via the hierarchy-GCN encoder and conducts deductive fusion of label-aware enzyme features to predict enzyme labels. Meanwhile, our hierarchy-GCN encoder is designed to bidirectionally compute to investigate the enzyme label correlation information in both bottom-up and top-down manners, which has not been explored in enzyme function prediction. Comparative experiments on three benchmark datasets show that GloEC achieves better predictive performance as compared to the existing methods. The case studies also demonstrate that GloEC is capable of effectively predicting the function of isoenzyme. GloEC is available at: https://github.com/hyr0771/GloEC.


Assuntos
Biologia Computacional , Enzimas , Enzimas/metabolismo , Enzimas/química , Biologia Computacional/métodos , Algoritmos , Bases de Dados de Proteínas
15.
Database (Oxford) ; 20242024 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-39066515

RESUMO

Biological databases serve as critical basics for modern research, and amid the dynamic landscape of biology, the COVID-19 database has emerged as an indispensable resource. The global outbreak of Covid-19, commencing in December 2019, necessitates comprehensive databases to unravel the intricate connections between this novel virus and cancer. Despite existing databases, a crucial need persists for a centralized and accessible method to acquire precise information within the research community. The main aim of the work is to develop a database which has all the COVID-19-related data available in just one click with auto global notifications. This gap is addressed by the meticulously designed COVID-19 Pandemic Database (CO-19 PDB 2.0), positioned as a comprehensive resource for researchers navigating the complexities of COVID-19 and cancer. Between December 2019 and June 2024, the CO-19 PDB 2.0 systematically collected and organized 120 datasets into six distinct categories, each catering to specific functionalities. These categories encompass a chemical structure database, a digital image database, a visualization tool database, a genomic database, a social science database, and a literature database. Functionalities range from image analysis and gene sequence information to data visualization and updates on environmental events. CO-19 PDB 2.0 has the option to choose either the search page for the database or the autonotification page, providing a seamless retrieval of information. The dedicated page introduces six predefined charts, providing insights into crucial criteria such as the number of cases and deaths', country-wise distribution, 'new cases and recovery', and rates of death and recovery. The global impact of COVID-19 on cancer patients has led to extensive collaboration among research institutions, producing numerous articles and computational studies published in international journals. A key feature of this initiative is auto daily notifications for standardized information updates. Users can easily navigate based on different categories or use a direct search option. The study offers up-to-date COVID-19 datasets and global statistics on COVID-19 and cancer, highlighting the top 10 cancers diagnosed in the USA in 2022. Breast and prostate cancers are the most common, representing 30% and 26% of new cases, respectively. The initiative also ensures the removal or replacement of dead links, providing a valuable resource for researchers, healthcare professionals, and individuals. The database has been implemented in PHP, HTML, CSS and MySQL and is available freely at https://www.co-19pdb.habdsk.org/. Database URL: https://www.co-19pdb.habdsk.org/.


Assuntos
COVID-19 , Neoplasias , Pandemias , SARS-CoV-2 , COVID-19/epidemiologia , COVID-19/virologia , Humanos , Neoplasias/epidemiologia , Bases de Dados Factuais , Infecções por Coronavirus/epidemiologia , Infecções por Coronavirus/virologia , Pneumonia Viral/epidemiologia , Pneumonia Viral/virologia , Betacoronavirus , Bases de Dados de Proteínas
16.
Interdiscip Sci ; 16(2): 261-288, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38955920

RESUMO

Protein complexes perform diverse biological functions, and obtaining their three-dimensional structure is critical to understanding and grasping their functions. In many cases, it's not just two proteins interacting to form a dimer; instead, multiple proteins interact to form a multimer. Experimentally resolving protein complex structures can be quite challenging. Recently, there have been efforts and methods that build upon prior predictions of dimer structures to attempt to predict multimer structures. However, in comparison to monomeric protein structure prediction, the accuracy of protein complex structure prediction remains relatively low. This paper provides an overview of recent advancements in efficient computational models for predicting protein complex structures. We introduce protein-protein docking methods in detail and summarize their main ideas, applicable modes, and related information. To enhance prediction accuracy, other critical protein-related information is also integrated, such as predicting interchain residue contact, utilizing experimental data like cryo-EM experiments, and considering protein interactions and non-interactions. In addition, we comprehensively review computational approaches for end-to-end prediction of protein complex structures based on artificial intelligence (AI) technology and describe commonly used datasets and representative evaluation metrics in protein complexes. Finally, we analyze the formidable challenges faced in current protein complex structure prediction tasks, including the structure prediction of heteromeric complex, disordered regions in complex, antibody-antigen complex, and RNA-related complex, as well as the evaluation metrics for complex assessment. We hope that this work will provide comprehensive knowledge of complex structure predictions to contribute to future advanced predictions.


Assuntos
Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Conformação Proteica , Simulação de Acoplamento Molecular , Inteligência Artificial , Bases de Dados de Proteínas
17.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-39038936

RESUMO

Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. These searches are also a critical component in most state-of-the-art machine learning and deep learning-based protein function predictors. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND-one of the most popular tools for function prediction-under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. Additionally, we developed a new scoring function to derive GO prediction from homologous hits that consistently outperform previously proposed scoring functions. These findings enable the improvement of almost all protein function prediction algorithms with a few easily implementable changes in their sequence homolog-based component. This study emphasizes the critical role of search parameter settings in homology-based function transfer and should have an important contribution to the development of future protein function prediction algorithms.


Assuntos
Bases de Dados de Proteínas , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biologia Computacional/métodos , Ontologia Genética , Algoritmos , Análise de Sequência de Proteína/métodos , Software , Aprendizado de Máquina
18.
PLoS Comput Biol ; 20(7): e1012302, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39046952

RESUMO

Protein kinase function and interactions with drugs are controlled in part by the movement of the DFG and ɑC-Helix motifs that are related to the catalytic activity of the kinase. Small molecule ligands elicit therapeutic effects with distinct selectivity profiles and residence times that often depend on the active or inactive kinase conformation(s) they bind. Modern AI-based structural modeling methods have the potential to expand upon the limited availability of experimentally determined kinase structures in inactive states. Here, we first explored the conformational space of kinases in the PDB and models generated by AlphaFold2 (AF2) and ESMFold, two prominent AI-based protein structure prediction methods. Our investigation of AF2's ability to explore the conformational diversity of the kinome at various multiple sequence alignment (MSA) depths showed a bias within the predicted structures of kinases in DFG-in conformations, particularly those controlled by the DFG motif, based on their overabundance in the PDB. We demonstrate that predicting kinase structures using AF2 at lower MSA depths explored these alternative conformations more extensively, including identifying previously unobserved conformations for 398 kinases. Ligand enrichment analyses for 23 kinases showed that, on average, docked models distinguished between active molecules and decoys better than random (average AUC (avgAUC) of 64.58), but select models perform well (e.g., avgAUCs for PTK2 and JAK2 were 79.28 and 80.16, respectively). Further analysis explained the ligand enrichment discrepancy between low- and high-performing kinase models as binding site occlusions that would preclude docking. The overall results of our analyses suggested that, although AF2 explored previously uncharted regions of the kinase conformational space and select models exhibited enrichment scores suitable for rational drug discovery, rigorous refinement of AF2 models is likely still necessary for drug discovery campaigns.


Assuntos
Biologia Computacional , Conformação Proteica , Proteínas Quinases , Proteínas Quinases/química , Proteínas Quinases/metabolismo , Modelos Moleculares , Ligantes , Inibidores de Proteínas Quinases/química , Inibidores de Proteínas Quinases/farmacologia , Bases de Dados de Proteínas , Humanos , Alinhamento de Sequência
19.
Sci Data ; 11(1): 783, 2024 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-39019896

RESUMO

Protein Data Bank (PDB) files list the relative spatial location of atoms in a protein structure as the final output of the process of fitting and refining to experimentally determined electron density measurements. Where experimental evidence exists for multiple conformations, atoms are modelled in alternate locations. Programs reading PDB files commonly ignore these alternate conformations by default leaving users oblivious to the presence of alternate conformations in the structures they analyze. This has led to underappreciation of their prevalence, under characterisation of their features and limited the accessibility to this high-resolution data representing structural ensembles. We have trawled PDB files to extract structural features of residues with alternately located atoms. The output includes the distance between alternate conformations and identifies the location of these segments within the protein chain and in proximity of all other atoms within a defined radius. This dataset should be of use in efforts to predict multiple structures from a single sequence and support studies investigating protein flexibility and the association with protein function.


Assuntos
Bases de Dados de Proteínas , Conformação Proteica , Proteínas , Proteínas/química , Cristalografia por Raios X , Modelos Moleculares
20.
Bioinformatics ; 40(7)2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38984796

RESUMO

MOTIVATION: The introduction of Deep Minds' Alpha Fold 2 enabled the prediction of protein structures at an unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens of terabytes of data, which hinders the effective use of predicted structures in large-scale analyses. RESULTS: Here, we present ProteStAr, a compressor dedicated to CIF/PDB, as well as supplementary PAE files. Its main contribution is a novel approach to predicting atom coordinates on the basis of the previously analyzed atoms. This allows efficient encoding of the coordinates, the largest component of the protein structure files. The compression is lossless by default, though the lossy mode with a controlled maximum error of coordinates reconstruction is also present. Compared to the competing packages, i.e. BinaryCIF, Foldcomp, PDC, our approach offers a superior compression ratio at established reconstruction accuracy. By the efficient use of threads at both compression and decompression stages, the algorithm takes advantage of the multicore architecture of current central processing units and operates with speeds of about 1 GB/s. The presence of Python and C++ API further increases the usability of the presented method. AVAILABILITY AND IMPLEMENTATION: The source code of ProteStAr is available at https://github.com/refresh-bio/protestar.


Assuntos
Algoritmos , Bases de Dados de Proteínas , Proteínas , Software , Proteínas/química , Conformação Proteica , Compressão de Dados/métodos , Biologia Computacional/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...