Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Mol Cell ; 83(22): 3950-3952, 2023 Nov 16.
Article in English | MEDLINE | ID: mdl-37977115

ABSTRACT

Two recent studies exploited ultra-fast structural aligners and deep-learning approaches to cluster the protein structure space in the AlphaFold Database. Barrio-Hernandez et al.1 and Durairaj et al.2 uncovered fascinating new protein functions and structural features previously unknown.


Subject(s)
Cluster Analysis , Databases, Factual
2.
Trends Biochem Sci ; 48(4): 345-359, 2023 04.
Article in English | MEDLINE | ID: mdl-36504138

ABSTRACT

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.


Subject(s)
Machine Learning , Proteins , Proteins/chemistry , Computational Biology/methods , Protein Conformation
3.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38718225

ABSTRACT

MOTIVATION: Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS: This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION: github.com/JudeWells/Chainsaw.


Subject(s)
Algorithms , Neural Networks, Computer , Protein Domains , Proteins , Proteins/chemistry , Databases, Protein , Computational Biology/methods , Software , Humans
4.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35641150

ABSTRACT

Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein-protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.


Subject(s)
Mutation, Missense , Proteins , Databases, Protein , Humans , Models, Molecular , Mutation , Proteins/chemistry , Proteins/genetics
5.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36648327

ABSTRACT

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Proteins , Humans , Sequence Homology, Amino Acid , Proteins/chemistry , Databases, Protein
6.
Nucleic Acids Res ; 49(D1): D266-D273, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33237325

ABSTRACT

CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.


Subject(s)
Computational Biology/statistics & numerical data , Databases, Protein/statistics & numerical data , Protein Domains , Proteins/chemistry , Amino Acid Sequence , COVID-19/epidemiology , COVID-19/prevention & control , COVID-19/virology , Computational Biology/methods , Epidemics , Humans , Internet , Molecular Sequence Annotation , Proteins/genetics , Proteins/metabolism , SARS-CoV-2/genetics , SARS-CoV-2/metabolism , SARS-CoV-2/physiology , Sequence Analysis, Protein/methods , Sequence Homology, Amino Acid , Viral Proteins/chemistry , Viral Proteins/genetics , Viral Proteins/metabolism
7.
J Cell Sci ; 133(16)2020 08 17.
Article in English | MEDLINE | ID: mdl-32665322

ABSTRACT

The yeast Hansenula polymorpha contains four members of the Pex23 family of peroxins, which characteristically contain a DysF domain. Here we show that all four H. polymorpha Pex23 family proteins localize to the endoplasmic reticulum (ER). Pex24 and Pex32, but not Pex23 and Pex29, predominantly accumulate at peroxisome-ER contacts. Upon deletion of PEX24 or PEX32 - and to a much lesser extent, of PEX23 or PEX29 - peroxisome-ER contacts are lost, concomitant with defects in peroxisomal matrix protein import, membrane growth, and organelle proliferation, positioning and segregation. These defects are suppressed by the introduction of an artificial peroxisome-ER tether, indicating that Pex24 and Pex32 contribute to tethering of peroxisomes to the ER. Accumulation of Pex32 at these contact sites is lost in cells lacking the peroxisomal membrane protein Pex11, in conjunction with disruption of the contacts. This indicates that Pex11 contributes to Pex32-dependent peroxisome-ER contact formation. The absence of Pex32 has no major effect on pre-peroxisomal vesicles that occur in pex3 atg1 deletion cells.


Subject(s)
Peroxisomes , Saccharomyces cerevisiae Proteins , Endoplasmic Reticulum/genetics , Membrane Proteins/genetics , Organelle Biogenesis , Peroxins/genetics , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Saccharomycetales
8.
Bioinformatics ; 37(20): 3449-3455, 2021 Oct 25.
Article in English | MEDLINE | ID: mdl-33978744

ABSTRACT

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

9.
Mol Syst Biol ; 17(9): e10079, 2021 09.
Article in English | MEDLINE | ID: mdl-34519429

ABSTRACT

We modeled 3D structures of all SARS-CoV-2 proteins, generating 2,060 models that span 69% of the viral proteome and provide details not available elsewhere. We found that ˜6% of the proteome mimicked human proteins, while ˜7% was implicated in hijacking mechanisms that reverse post-translational modifications, block host translation, and disable host defenses; a further ˜29% self-assembled into heteromeric states that provided insight into how the viral replication and translation complex forms. To make these 3D models more accessible, we devised a structural coverage map, a novel visualization method to show what is-and is not-known about the 3D structure of the viral proteome. We integrated the coverage map into an accompanying online resource (https://aquaria.ws/covid) that can be used to find and explore models corresponding to the 79 structural states identified in this work. The resulting Aquaria-COVID resource helps scientists use emerging structural data to understand the mechanisms underlying coronavirus infection and draws attention to the 31% of the viral proteome that remains structurally unknown or dark.


Subject(s)
Angiotensin-Converting Enzyme 2/metabolism , Host-Pathogen Interactions/genetics , Protein Processing, Post-Translational , SARS-CoV-2/metabolism , Spike Glycoprotein, Coronavirus/metabolism , Amino Acid Transport Systems, Neutral/chemistry , Amino Acid Transport Systems, Neutral/genetics , Amino Acid Transport Systems, Neutral/metabolism , Angiotensin-Converting Enzyme 2/chemistry , Angiotensin-Converting Enzyme 2/genetics , Binding Sites , COVID-19/genetics , COVID-19/metabolism , COVID-19/virology , Computational Biology/methods , Coronavirus Envelope Proteins/chemistry , Coronavirus Envelope Proteins/genetics , Coronavirus Envelope Proteins/metabolism , Coronavirus Nucleocapsid Proteins/chemistry , Coronavirus Nucleocapsid Proteins/genetics , Coronavirus Nucleocapsid Proteins/metabolism , Humans , Mitochondrial Membrane Transport Proteins/chemistry , Mitochondrial Membrane Transport Proteins/genetics , Mitochondrial Membrane Transport Proteins/metabolism , Mitochondrial Precursor Protein Import Complex Proteins , Models, Molecular , Molecular Mimicry , Neuropilin-1/chemistry , Neuropilin-1/genetics , Neuropilin-1/metabolism , Phosphoproteins/chemistry , Phosphoproteins/genetics , Phosphoproteins/metabolism , Protein Binding , Protein Conformation, alpha-Helical , Protein Conformation, beta-Strand , Protein Interaction Domains and Motifs , Protein Interaction Mapping/methods , Protein Multimerization , SARS-CoV-2/chemistry , SARS-CoV-2/genetics , Spike Glycoprotein, Coronavirus/chemistry , Spike Glycoprotein, Coronavirus/genetics , Viral Matrix Proteins/chemistry , Viral Matrix Proteins/genetics , Viral Matrix Proteins/metabolism , Viroporin Proteins/chemistry , Viroporin Proteins/genetics , Viroporin Proteins/metabolism , Virus Replication
10.
Bioinformatics ; 34(22): 3937-3938, 2018 11 15.
Article in English | MEDLINE | ID: mdl-29931249

ABSTRACT

Summary: We introduce ICBdocker, a Docker environment that allows the annotation of functional and structural features of proteomes through a Python/Perl pipeline. DataTables pages make it easy to set up a web-resource for research groups with a focus on the same organisms or datasets. The results are available as tab-separated values files and HTML, allowing data analysis and browsing. The pipeline focuses on modularity and scalability, with capability of integrating with multi-processing and high-performance computing clusters. Availability and implementation: ICBdocker is freely available on DockerHub at https://hub.docker.com/r/bordin89/icb/ Source code and documentation are available on GitHub at: https://github.com/bordin89/ICB_docker.


Subject(s)
Proteome , Software , Computational Biology , Proteomics
11.
Genomics ; 110(5): 231-238, 2018 09.
Article in English | MEDLINE | ID: mdl-29074368

ABSTRACT

Planctomycetes are bacteria with complex molecular and cellular biology. They have large genomes, some over 7Mb, and complex life cycles that include motile cells and sessile cells. Some live on the complex biofilm of macroalgae. Factors governing their life in this environment were investigated at the genomic level. We analyzed the genomes of three planctomycetes isolated from algal surfaces. The genomes were 6.6Mbp to 8.1Mbp large. Genes for outer-membrane proteins, peptidoglycan and lipopolysaccharide biosynthesis were present. Rubripirellula obstinata LF1T, Roseimaritima ulvae UC8T and Mariniblastus fucicola FC18T shared with Rhodopirellula baltica and R. rubra SWK7 unique proteins related to metal binding systems, phosphate metabolism, chemotaxis, and stress response. These functions may contribute to their ecological success in such a complex environment. Exceptionally huge proteins (6000 to 10,000 amino-acids) with extracellular, periplasmic or membrane-associated locations were found which may be involved in biofilm formation or cell adhesion.


Subject(s)
Genome, Bacterial , Planctomycetales/genetics , Bacterial Outer Membrane Proteins/genetics , Biofilms , Chlorophyta/microbiology , Lipopolysaccharides/biosynthesis , Lipopolysaccharides/genetics , Phaeophyceae/microbiology , Planctomycetales/pathogenicity , Planctomycetales/physiology , Proteoglycans/genetics
12.
BMC Microbiol ; 18(1): 133, 2018 10 16.
Article in English | MEDLINE | ID: mdl-30326838

ABSTRACT

BACKGROUND: Bacillus licheniformis GL174 is a culturable endophytic strain isolated from Vitis vinifera cultivar Glera, the grapevine mainly cultivated for the Prosecco wine production. This strain was previously demonstrated to possess some specific plant growth promoting traits but its endophytic attitude and its role in biocontrol was only partially explored. In this study, the potential biocontrol action of the strain was investigated in vitro and in vivo and, by genome sequence analyses, putative functions involved in biocontrol and plant-bacteria interaction were assessed. RESULTS: Firstly, to confirm the endophytic behavior of the strain, its ability to colonize grapevine tissues was demonstrated and its biocontrol properties were analyzed. Antagonism test results showed that the strain could reduce and inhibit the mycelium growth of diverse plant pathogens in vitro and in vivo. The strain was demonstrated to produce different molecules of the lipopeptide class; moreover, its genome was sequenced, and analysis of the sequences revealed the presence of many protein-coding genes involved in the biocontrol process, such as transporters, plant-cell lytic enzymes, siderophores and other secondary metabolites. CONCLUSIONS: This step-by-step analysis shows that Bacillus licheniformis GL174 may be a good biocontrol agent candidate, and describes some distinguished traits and possible key elements involved in this process. The use of this strain could potentially help grapevine plants to cope with pathogen attacks and reduce the amount of chemicals used in the vineyard.


Subject(s)
Bacillus licheniformis/physiology , Biological Control Agents , Vitis/microbiology , Bacillus licheniformis/genetics , Biodiversity , Endophytes/genetics , Endophytes/physiology , Genome, Bacterial , Phylogeny , Plant Diseases/microbiology , Plant Leaves/microbiology , Plant Roots/microbiology , Sequence Analysis, DNA , Whole Genome Sequencing
13.
Protein Sci ; 33(9): e5140, 2024 Sep.
Article in English | MEDLINE | ID: mdl-39145441

ABSTRACT

Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.


Subject(s)
Algorithms , Databases, Protein , Proteins , Proteins/chemistry , Proteins/metabolism , Cluster Analysis , Computational Biology/methods , Protein Domains
14.
Sci Rep ; 14(1): 14208, 2024 06 20.
Article in English | MEDLINE | ID: mdl-38902252

ABSTRACT

The COVID-19 disease is an ongoing global health concern. Although vaccination provides some protection, people are still susceptible to re-infection. Ostensibly, certain populations or clinical groups may be more vulnerable. Factors causing these differences are unclear and whilst socioeconomic and cultural differences are likely to be important, human genetic factors could influence susceptibility. Experimental studies indicate SARS-CoV-2 uses innate immune suppression as a strategy to speed-up entry and replication into the host cell. Therefore, it is necessary to understand the impact of variants in immunity-associated human proteins on susceptibility to COVID-19. In this work, we analysed missense coding variants in several SARS-CoV-2 proteins and their human protein interactors that could enhance binding affinity to SARS-CoV-2. We curated a dataset of 19 SARS-CoV-2: human protein 3D-complexes, from the experimentally determined structures in the Protein Data Bank and models built using AlphaFold2-multimer, and analysed the impact of missense variants occurring in the protein-protein interface region. We analysed 468 missense variants from human proteins and 212 variants from SARS-CoV-2 proteins and computationally predicted their impacts on binding affinities for the human viral protein complexes. We predicted a total of 26 affinity-enhancing variants from 13 human proteins implicated in increased binding affinity to SARS-CoV-2. These include key-immunity associated genes (TOMM70, ISG15, IFIH1, IFIT2, RPS3, PALS1, NUP98, AXL, ARF6, TRIMM, TRIM25) as well as important spike receptors (KREMEN1, AXL and ACE2). We report both common (e.g., Y13N in IFIH1) and rare variants in these proteins and discuss their likely structural and functional impact, using information on known and predicted functional sites. Potential mechanisms associated with immune suppression implicated by these variants are discussed. Occurrence of certain predicted affinity-enhancing variants should be monitored as they could lead to increased susceptibility and reduced immune response to SARS-CoV-2 infection in individuals/populations carrying them. Our analyses aid in understanding the potential impact of genetic variation in immunity-associated proteins on COVID-19 susceptibility and help guide drug-repurposing strategies.


Subject(s)
COVID-19 , Mutation, Missense , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , SARS-CoV-2/immunology , COVID-19/genetics , COVID-19/virology , COVID-19/immunology , Drug Repositioning , Viral Proteins/genetics , Viral Proteins/metabolism , Protein Binding , Genetic Predisposition to Disease , Disease Susceptibility , COVID-19 Drug Treatment
15.
J Mol Biol ; 436(17): 168551, 2024 Sep 01.
Article in English | MEDLINE | ID: mdl-38548261

ABSTRACT

CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.


Subject(s)
Databases, Protein , Protein Folding , Proteins/chemistry , Proteins/metabolism , Protein Conformation , Models, Molecular , Computational Biology/methods , Protein Domains , Animals , Software , Humans
16.
Curr Opin Struct Biol ; 79: 102543, 2023 04.
Article in English | MEDLINE | ID: mdl-36807079

ABSTRACT

The function of proteins can often be inferred from their three-dimensional structures. Experimental structural biologists spent decades studying these structures, but the accelerated pace of protein sequencing continuously increases the gaps between sequences and structures. The early 2020s saw the advent of a new generation of deep learning-based protein structure prediction tools that offer the potential to predict structures based on any number of protein sequences. In this review, we give an overview of the impact of this new generation of structure prediction tools, with examples of the impacted field in the life sciences. We discuss the novel opportunities and new scientific and technical challenges these tools present to the broader scientific community. Finally, we highlight some potential directions for the future of computational protein structure prediction.


Subject(s)
Deep Learning , Computational Biology/methods , Proteins/chemistry , Amino Acid Sequence
17.
Elife ; 122023 10 03.
Article in English | MEDLINE | ID: mdl-37787768

ABSTRACT

Many proteins remain poorly characterized even in well-studied organisms, presenting a bottleneck for research. We applied phenomics and machine-learning approaches with Schizosaccharomyces pombe for broad cues on protein functions. We assayed colony-growth phenotypes to measure the fitness of deletion mutants for 3509 non-essential genes in 131 conditions with different nutrients, drugs, and stresses. These analyses exposed phenotypes for 3492 mutants, including 124 mutants of 'priority unstudied' proteins conserved in humans, providing varied functional clues. For example, over 900 proteins were newly implicated in the resistance to oxidative stress. Phenotype-correlation networks suggested roles for poorly characterized proteins through 'guilt by association' with known proteins. For complementary functional insights, we predicted Gene Ontology (GO) terms using machine learning methods exploiting protein-network and protein-homology data (NET-FF). We obtained 56,594 high-scoring GO predictions, of which 22,060 also featured high information content. Our phenotype-correlation data and NET-FF predictions showed a strong concordance with existing PomBase GO annotations and protein networks, with integrated analyses revealing 1675 novel GO predictions for 783 genes, including 47 predictions for 23 priority unstudied proteins. Experimental validation identified new proteins involved in cellular aging, showing that these predictions and phenomics data provide a rich resource to uncover new protein functions.


Subject(s)
Schizosaccharomyces pombe Proteins , Schizosaccharomyces , Humans , Phenomics , Schizosaccharomyces pombe Proteins/genetics , Phenotype , Schizosaccharomyces/genetics , Machine Learning
18.
Biomolecules ; 13(2)2023 02 02.
Article in English | MEDLINE | ID: mdl-36830646

ABSTRACT

Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.


Subject(s)
Protein Kinases , Proteins , Humans , Protein Kinases/metabolism , Proteins/chemistry , Databases, Protein , Sequence Homology, Amino Acid
19.
Commun Biol ; 6(1): 160, 2023 02 08.
Article in English | MEDLINE | ID: mdl-36755055

ABSTRACT

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.


Subject(s)
Furylfuramide , Proteins , Humans , Databases, Protein , Proteins/chemistry
20.
NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35702380

ABSTRACT

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

SELECTION OF CITATIONS
SEARCH DETAIL