Search | Brasil - Virtual Health Library

1.

Predicting human and viral protein variants affecting COVID-19 susceptibility and repurposing therapeutics.

Waman, Vaishali P; Ashford, Paul; Lam, Su Datt; Sen, Neeladri; Abbasian, Mahnaz; Woodridge, Laurel; Goldtzvik, Yonathan; Bordin, Nicola; Wu, Jiaxin; Sillitoe, Ian; Orengo, Christine A.

Sci Rep ; 14(1): 14208, 2024 06 20.

Article in English | MEDLINE | ID: mdl-38902252

ABSTRACT

The COVID-19 disease is an ongoing global health concern. Although vaccination provides some protection, people are still susceptible to re-infection. Ostensibly, certain populations or clinical groups may be more vulnerable. Factors causing these differences are unclear and whilst socioeconomic and cultural differences are likely to be important, human genetic factors could influence susceptibility. Experimental studies indicate SARS-CoV-2 uses innate immune suppression as a strategy to speed-up entry and replication into the host cell. Therefore, it is necessary to understand the impact of variants in immunity-associated human proteins on susceptibility to COVID-19. In this work, we analysed missense coding variants in several SARS-CoV-2 proteins and their human protein interactors that could enhance binding affinity to SARS-CoV-2. We curated a dataset of 19 SARS-CoV-2: human protein 3D-complexes, from the experimentally determined structures in the Protein Data Bank and models built using AlphaFold2-multimer, and analysed the impact of missense variants occurring in the protein-protein interface region. We analysed 468 missense variants from human proteins and 212 variants from SARS-CoV-2 proteins and computationally predicted their impacts on binding affinities for the human viral protein complexes. We predicted a total of 26 affinity-enhancing variants from 13 human proteins implicated in increased binding affinity to SARS-CoV-2. These include key-immunity associated genes (TOMM70, ISG15, IFIH1, IFIT2, RPS3, PALS1, NUP98, AXL, ARF6, TRIMM, TRIM25) as well as important spike receptors (KREMEN1, AXL and ACE2). We report both common (e.g., Y13N in IFIH1) and rare variants in these proteins and discuss their likely structural and functional impact, using information on known and predicted functional sites. Potential mechanisms associated with immune suppression implicated by these variants are discussed. Occurrence of certain predicted affinity-enhancing variants should be monitored as they could lead to increased susceptibility and reduced immune response to SARS-CoV-2 infection in individuals/populations carrying them. Our analyses aid in understanding the potential impact of genetic variation in immunity-associated proteins on COVID-19 susceptibility and help guide drug-repurposing strategies.

Subject(s)

COVID-19 , Mutation, Missense , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , SARS-CoV-2/immunology , COVID-19/genetics , COVID-19/virology , COVID-19/immunology , Drug Repositioning , Viral Proteins/genetics , Viral Proteins/metabolism , Protein Binding , Genetic Predisposition to Disease , Disease Susceptibility , COVID-19 Drug Treatment

2.

Chainsaw: protein domain segmentation with fully convolutional neural networks.

Wells, Jude; Hawkins-Hooker, Alex; Bordin, Nicola; Sillitoe, Ian; Paige, Brooks; Orengo, Christine.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38718225

ABSTRACT

MOTIVATION: Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS: This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION: github.com/JudeWells/Chainsaw.

Subject(s)

Algorithms , Neural Networks, Computer , Protein Domains , Proteins , Proteins/chemistry , Databases, Protein , Computational Biology/methods , Software , Humans

3.

CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds.

Waman, Vaishali P; Bordin, Nicola; Alcraft, Rachel; Vickerstaff, Robert; Rauer, Clemens; Chan, Qian; Sillitoe, Ian; Yamamori, Hazuki; Orengo, Christine.

J Mol Biol ; : 168551, 2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38548261

ABSTRACT

CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.

4.

Large-scale clustering of AlphaFold2 3D models shines light on the structure and function of proteins.

Bordin, Nicola; Lau, Andy M; Orengo, Christine.

Mol Cell ; 83(22): 3950-3952, 2023 Nov 16.

Article in English | MEDLINE | ID: mdl-37977115

ABSTRACT

Two recent studies exploited ultra-fast structural aligners and deep-learning approaches to cluster the protein structure space in the AlphaFold Database. Barrio-Hernandez et al.1 and Durairaj et al.2 uncovered fascinating new protein functions and structural features previously unknown.

Subject(s)

Cluster Analysis , Databases, Factual

5.

Broad functional profiling of fission yeast proteins using phenomics and machine learning.

Rodríguez-López, María; Bordin, Nicola; Lees, Jon; Scholes, Harry; Hassan, Shaimaa; Saintain, Quentin; Kamrad, Stephan; Orengo, Christine; Bähler, Jürg.

Elife ; 122023 10 03.

Article in English | MEDLINE | ID: mdl-37787768

ABSTRACT

Many proteins remain poorly characterized even in well-studied organisms, presenting a bottleneck for research. We applied phenomics and machine-learning approaches with Schizosaccharomyces pombe for broad cues on protein functions. We assayed colony-growth phenotypes to measure the fitness of deletion mutants for 3509 non-essential genes in 131 conditions with different nutrients, drugs, and stresses. These analyses exposed phenotypes for 3492 mutants, including 124 mutants of 'priority unstudied' proteins conserved in humans, providing varied functional clues. For example, over 900 proteins were newly implicated in the resistance to oxidative stress. Phenotype-correlation networks suggested roles for poorly characterized proteins through 'guilt by association' with known proteins. For complementary functional insights, we predicted Gene Ontology (GO) terms using machine learning methods exploiting protein-network and protein-homology data (NET-FF). We obtained 56,594 high-scoring GO predictions, of which 22,060 also featured high information content. Our phenotype-correlation data and NET-FF predictions showed a strong concordance with existing PomBase GO annotations and protein networks, with integrated analyses revealing 1675 novel GO predictions for 783 genes, including 47 predictions for 23 priority unstudied proteins. Experimental validation identified new proteins involved in cellular aging, showing that these predictions and phenomics data provide a rich resource to uncover new protein functions.

Subject(s)

Schizosaccharomyces pombe Proteins , Schizosaccharomyces , Humans , Phenomics , Schizosaccharomyces pombe Proteins/genetics , Phenotype , Schizosaccharomyces/genetics , Machine Learning

6.

KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units.

Adeyelu, Tolulope; Bordin, Nicola; Waman, Vaishali P; Sadlej, Marta; Sillitoe, Ian; Moya-Garcia, Aurelio A; Orengo, Christine A.

Biomolecules ; 13(2)2023 02 02.

Article in English | MEDLINE | ID: mdl-36830646

ABSTRACT

Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.

Subject(s)

Protein Kinases , Proteins , Humans , Protein Kinases/metabolism , Proteins/chemistry , Databases, Protein , Sequence Homology, Amino Acid

7.

The opportunities and challenges posed by the new generation of deep learning-based protein structure predictors.

Varadi, Mihaly; Bordin, Nicola; Orengo, Christine; Velankar, Sameer.

Curr Opin Struct Biol ; 79: 102543, 2023 04.

Article in English | MEDLINE | ID: mdl-36807079

ABSTRACT

The function of proteins can often be inferred from their three-dimensional structures. Experimental structural biologists spent decades studying these structures, but the accelerated pace of protein sequencing continuously increases the gaps between sequences and structures. The early 2020s saw the advent of a new generation of deep learning-based protein structure prediction tools that offer the potential to predict structures based on any number of protein sequences. In this review, we give an overview of the impact of this new generation of structure prediction tools, with examples of the impacted field in the life sciences. We discuss the novel opportunities and new scientific and technical challenges these tools present to the broader scientific community. Finally, we highlight some potential directions for the future of computational protein structure prediction.

Subject(s)

Deep Learning , Computational Biology/methods , Proteins/chemistry , Amino Acid Sequence

8.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms.

Bordin, Nicola; Sillitoe, Ian; Nallapareddy, Vamsi; Rauer, Clemens; Lam, Su Datt; Waman, Vaishali P; Sen, Neeladri; Heinzinger, Michael; Littmann, Maria; Kim, Stephanie; Velankar, Sameer; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Commun Biol ; 6(1): 160, 2023 02 08.

Article in English | MEDLINE | ID: mdl-36755055

ABSTRACT

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.

Subject(s)

Furylfuramide , Proteins , Humans , Databases, Protein , Proteins/chemistry

9.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.

Nallapareddy, Vamsi; Bordin, Nicola; Sillitoe, Ian; Heinzinger, Michael; Littmann, Maria; Waman, Vaishali P; Sen, Neeladri; Rost, Burkhard; Orengo, Christine.

Bioinformatics ; 39(1)2023 01 01.

Article in English | MEDLINE | ID: mdl-36648327

ABSTRACT

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Proteins , Humans , Sequence Homology, Amino Acid , Proteins/chemistry , Databases, Protein

10.

Novel machine learning approaches revolutionize protein knowledge.

Bordin, Nicola; Dallago, Christian; Heinzinger, Michael; Kim, Stephanie; Littmann, Maria; Rauer, Clemens; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Trends Biochem Sci ; 48(4): 345-359, 2023 04.

Article in English | MEDLINE | ID: mdl-36504138

ABSTRACT

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.

Subject(s)

Machine Learning , Proteins , Proteins/chemistry , Computational Biology/methods , Protein Conformation

11.

Contrastive learning on protein embeddings enlightens midnight zone.

Heinzinger, Michael; Littmann, Maria; Sillitoe, Ian; Bordin, Nicola; Orengo, Christine; Rost, Burkhard.

NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.

Article in English | MEDLINE | ID: mdl-35702380

ABSTRACT

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

12.

Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs.

Sen, Neeladri; Anishchenko, Ivan; Bordin, Nicola; Sillitoe, Ian; Velankar, Sameer; Baker, David; Orengo, Christine.

Brief Bioinform ; 23(4)2022 07 18.

Article in English | MEDLINE | ID: mdl-35641150

ABSTRACT

Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein-protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.

Subject(s)

Mutation, Missense , Proteins , Databases, Protein , Humans , Models, Molecular , Mutation , Proteins/chemistry , Proteins/genetics

13.

Detection and enumeration of Lak megaphages in microbiome samples by endpoint and quantitative PCR.

Crisci, Marco A; Corsini, Paula M; Bordin, Nicola; Chen, Lin-Xing; Banfield, Jillain F; Santini, Joanne M.

STAR Protoc ; 3(1): 101029, 2022 03 18.

Article in English | MEDLINE | ID: mdl-35059650

ABSTRACT

Lak megaphages are prevalent across diverse gut microbiomes and may potentially impact animal and human health through lysis of Prevotella. Given their large genome size (up to 660 kbp), Lak megaphages are difficult to culture, and their identification relies on molecular techniques. Here, we present optimized protocols for identifying Lak phages in various microbiome samples, including procedures for DNA extraction, followed by detection and quantification of genes encoding Lak structural proteins using diagnostic endpoint and SYBR green-based quantitative PCR, respectively. For complete details on the use and execution of this protocol, please refer to Crisci et al., (2021).

Subject(s)

Bacteriophages , Gastrointestinal Microbiome , Microbiota , Animals , Bacteriophages/genetics , Microbiota/genetics , Prevotella/genetics , Real-Time Polymerase Chain Reaction/methods

14.

SARS-CoV-2 structural coverage map reveals viral protein assembly, mimicry, and hijacking mechanisms.

O'Donoghue, Seán I; Schafferhans, Andrea; Sikta, Neblina; Stolte, Christian; Kaur, Sandeep; Ho, Bosco K; Anderson, Stuart; Procter, James B; Dallago, Christian; Bordin, Nicola; Adcock, Matt; Rost, Burkhard.

Mol Syst Biol ; 17(9): e10079, 2021 09.

Article in English | MEDLINE | ID: mdl-34519429

ABSTRACT

We modeled 3D structures of all SARS-CoV-2 proteins, generating 2,060 models that span 69% of the viral proteome and provide details not available elsewhere. We found that Ë6% of the proteome mimicked human proteins, while Ë7% was implicated in hijacking mechanisms that reverse post-translational modifications, block host translation, and disable host defenses; a further Ë29% self-assembled into heteromeric states that provided insight into how the viral replication and translation complex forms. To make these 3D models more accessible, we devised a structural coverage map, a novel visualization method to show what is-and is not-known about the 3D structure of the viral proteome. We integrated the coverage map into an accompanying online resource (https://aquaria.ws/covid) that can be used to find and explore models corresponding to the 79 structural states identified in this work. The resulting Aquaria-COVID resource helps scientists use emerging structural data to understand the mechanisms underlying coronavirus infection and draws attention to the 31% of the viral proteome that remains structurally unknown or dark.

Subject(s)

Angiotensin-Converting Enzyme 2/metabolism , Host-Pathogen Interactions/genetics , Protein Processing, Post-Translational , SARS-CoV-2/metabolism , Spike Glycoprotein, Coronavirus/metabolism , Amino Acid Transport Systems, Neutral/chemistry , Amino Acid Transport Systems, Neutral/genetics , Amino Acid Transport Systems, Neutral/metabolism , Angiotensin-Converting Enzyme 2/chemistry , Angiotensin-Converting Enzyme 2/genetics , Binding Sites , COVID-19/genetics , COVID-19/metabolism , COVID-19/virology , Computational Biology/methods , Coronavirus Envelope Proteins/chemistry , Coronavirus Envelope Proteins/genetics , Coronavirus Envelope Proteins/metabolism , Coronavirus Nucleocapsid Proteins/chemistry , Coronavirus Nucleocapsid Proteins/genetics , Coronavirus Nucleocapsid Proteins/metabolism , Humans , Mitochondrial Membrane Transport Proteins/chemistry , Mitochondrial Membrane Transport Proteins/genetics , Mitochondrial Membrane Transport Proteins/metabolism , Mitochondrial Precursor Protein Import Complex Proteins , Models, Molecular , Molecular Mimicry , Neuropilin-1/chemistry , Neuropilin-1/genetics , Neuropilin-1/metabolism , Phosphoproteins/chemistry , Phosphoproteins/genetics , Phosphoproteins/metabolism , Protein Binding , Protein Conformation, alpha-Helical , Protein Conformation, beta-Strand , Protein Interaction Domains and Motifs , Protein Interaction Mapping/methods , Protein Multimerization , SARS-CoV-2/chemistry , SARS-CoV-2/genetics , Spike Glycoprotein, Coronavirus/chemistry , Spike Glycoprotein, Coronavirus/genetics , Viral Matrix Proteins/chemistry , Viral Matrix Proteins/genetics , Viral Matrix Proteins/metabolism , Viroporin Proteins/chemistry , Viroporin Proteins/genetics , Viroporin Proteins/metabolism , Virus Replication

15.

Closely related Lak megaphages replicate in the microbiomes of diverse animals.

Crisci, Marco A; Chen, Lin-Xing; Devoto, Audra E; Borges, Adair L; Bordin, Nicola; Sachdeva, Rohan; Tett, Adrian; Sharrar, Allison M; Segata, Nicola; Debenedetti, Francesco; Bailey, Mick; Burt, Rachel; Wood, Rhiannon M; Rowden, Lewis J; Corsini, Paula M; van Winden, Steven; Holmes, Mark A; Lei, Shufei; Banfield, Jillian F; Santini, Joanne M.

iScience ; 24(8): 102875, 2021 Aug 20.

Article in English | MEDLINE | ID: mdl-34386733

ABSTRACT

Lak phages with alternatively coded â¼540 kbp genomes were recently reported to replicate in Prevotella in microbiomes of humans that consume a non-Western diet, baboons, and pigs. Here, we explore Lak phage diversity and broader distribution using diagnostic polymerase chain reaction and genome-resolved metagenomics. Lak phages were detected in 13 animal types, including reptiles, and are particularly prevalent in pigs. Tracking Lak through the pig gastrointestinal tract revealed significant enrichment in the hindgut compared to the foregut. We reconstructed 34 new Lak genomes, including six curated complete genomes, all of which are alternatively coded. An anomalously large (â¼660 kbp) complete genome reconstructed for the most deeply branched Lak from a horse microbiome is also alternatively coded. From the Lak genomes, we identified proteins associated with specific animal species; notably, most have no functional predictions. The presence of closely related Lak phages in diverse animals indicates facile distribution coupled to host-specific adaptation.

16.

Tracing Evolution Through Protein Structures: Nature Captured in a Few Thousand Folds.

Bordin, Nicola; Sillitoe, Ian; Lees, Jonathan G; Orengo, Christine.

Front Mol Biosci ; 8: 668184, 2021.

Article in English | MEDLINE | ID: mdl-34041266

ABSTRACT

This article is dedicated to the memory of Cyrus Chothia, who was a leading light in the world of protein structure evolution. His elegant analyses of protein families and their mechanisms of structural and functional evolution provided important evolutionary and biological insights and firmly established the value of structural perspectives. He was a mentor and supervisor to many other leading scientists who continued his quest to characterise structure and function space. He was also a generous and supportive colleague to those applying different approaches. In this article we review some of his accomplishments and the history of protein structure classifications, particularly SCOP and CATH. We also highlight some of the evolutionary insights these two classifications have brought. Finally, we discuss how the expansion and integration of protein sequence data into these structural families helps reveal the dark matter of function space and can inform the emergence of novel functions in Metazoa. Since we cover 25 years of structural classification, it has not been feasible to review all structure based evolutionary studies and hence we focus mainly on those undertaken by the SCOP and CATH groups and their collaborators.

17.

Clustering FunFams using sequence embeddings improves EC purity.

Littmann, Maria; Bordin, Nicola; Heinzinger, Michael; Schütze, Konstantin; Dallago, Christian; Orengo, Christine; Rost, Burkhard.

Bioinformatics ; 37(20): 3449-3455, 2021 Oct 25.

Article in English | MEDLINE | ID: mdl-33978744

ABSTRACT

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

18.

CATH: increased structural coverage of functional space.

Sillitoe, Ian; Bordin, Nicola; Dawson, Natalie; Waman, Vaishali P; Ashford, Paul; Scholes, Harry M; Pang, Camilla S M; Woodridge, Laurel; Rauer, Clemens; Sen, Neeladri; Abbasian, Mahnaz; Le Cornu, Sean; Lam, Su Datt; Berka, Karel; Varekova, Ivana Hutarová; Svobodova, Radka; Lees, Jon; Orengo, Christine A.

Nucleic Acids Res ; 49(D1): D266-D273, 2021 01 08.

Article in English | MEDLINE | ID: mdl-33237325

ABSTRACT

CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.

Subject(s)

Computational Biology/statistics & numerical data , Databases, Protein/statistics & numerical data , Protein Domains , Proteins/chemistry , Amino Acid Sequence , COVID-19/epidemiology , COVID-19/prevention & control , COVID-19/virology , Computational Biology/methods , Epidemics , Humans , Internet , Molecular Sequence Annotation , Proteins/genetics , Proteins/metabolism , SARS-CoV-2/genetics , SARS-CoV-2/metabolism , SARS-CoV-2/physiology , Sequence Analysis, Protein/methods , Sequence Homology, Amino Acid , Viral Proteins/chemistry , Viral Proteins/genetics , Viral Proteins/metabolism

19.

Pex24 and Pex32 are required to tether peroxisomes to the ER for organelle biogenesis, positioning and segregation in yeast.

Wu, Fei; de Boer, Rinse; Krikken, Arjen M; Aksit, Arman; Bordin, Nicola; Devos, Damien P; van der Klei, Ida J.

J Cell Sci ; 133(16)2020 08 17.

Article in English | MEDLINE | ID: mdl-32665322

ABSTRACT

The yeast Hansenula polymorpha contains four members of the Pex23 family of peroxins, which characteristically contain a DysF domain. Here we show that all four H. polymorpha Pex23 family proteins localize to the endoplasmic reticulum (ER). Pex24 and Pex32, but not Pex23 and Pex29, predominantly accumulate at peroxisome-ER contacts. Upon deletion of PEX24 or PEX32 - and to a much lesser extent, of PEX23 or PEX29 - peroxisome-ER contacts are lost, concomitant with defects in peroxisomal matrix protein import, membrane growth, and organelle proliferation, positioning and segregation. These defects are suppressed by the introduction of an artificial peroxisome-ER tether, indicating that Pex24 and Pex32 contribute to tethering of peroxisomes to the ER. Accumulation of Pex32 at these contact sites is lost in cells lacking the peroxisomal membrane protein Pex11, in conjunction with disruption of the contacts. This indicates that Pex11 contributes to Pex32-dependent peroxisome-ER contact formation. The absence of Pex32 has no major effect on pre-peroxisomal vesicles that occur in pex3 atg1 deletion cells.

Subject(s)

Peroxisomes , Saccharomyces cerevisiae Proteins , Endoplasmic Reticulum/genetics , Membrane Proteins/genetics , Organelle Biogenesis , Peroxins/genetics , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Saccharomycetales

20.

Hansenula polymorpha Pex37 is a peroxisomal membrane protein required for organelle fission and segregation.

Singh, Ritika; Manivannan, Selvambigai; Krikken, Arjen M; de Boer, Rinse; Bordin, Nicola; Devos, Damien P; van der Klei, Ida J.

FEBS J ; 287(9): 1742-1757, 2020 05.

Article in English | MEDLINE | ID: mdl-31692262

ABSTRACT

Here, we describe a novel peroxin, Pex37, in the yeast Hansenula polymorpha. H. polymorpha Pex37 is a peroxisomal membrane protein, which belongs to a protein family that includes, among others, the Neurospora crassa Woronin body protein Wsc, the human peroxisomal membrane protein PXMP2, the Saccharomyces cerevisiae mitochondrial inner membrane protein Sym1, and its mammalian homologue MPV17. We show that deletion of H. polymorpha PEX37 does not appear to have a significant effect on peroxisome biogenesis or proliferation in cells grown at peroxisome-inducing growth conditions (methanol). However, the absence of Pex37 results in a reduction in peroxisome numbers and a defect in peroxisome segregation in cells grown at peroxisome-repressing conditions (glucose). Conversely, overproduction of Pex37 in glucose-grown cells results in an increase in peroxisome numbers in conjunction with a decrease in their size. The increase in numbers in PEX37-overexpressing cells depends on the dynamin-related protein Dnm1. Together our data suggest that Pex37 is involved in peroxisome fission in glucose-grown cells. Introduction of human PXMP2 in H. polymorpha pex37 cells partially restored the peroxisomal phenotype, indicating that PXMP2 represents a functional homologue of Pex37. H.polymorpha pex37 cells did not show aberrant growth on any of the tested carbon and nitrogen sources that are metabolized by peroxisomal enzymes, suggesting that Pex37 may not fulfill an essential function in transport of these substrates or compounds required for their metabolism across the peroxisomal membrane.

Subject(s)

Fungal Proteins/metabolism , Membrane Proteins/metabolism , Organelles/metabolism , Peroxisomes/metabolism , Saccharomycetales/chemistry , Fungal Proteins/chemistry , Membrane Proteins/chemistry , Organelles/chemistry , Peroxisomes/chemistry , Saccharomycetales/cytology , Saccharomycetales/metabolism

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL