RESUMEN
Proteins and RNA functionally and physically intersect in multiple biological processes, however, currently no universal method is available to purify protein-RNA complexes. Here, we introduce XRNAX, a method for the generic purification of protein-crosslinked RNA, and demonstrate its versatility to study the composition and dynamics of protein-RNA interactions by various transcriptomic and proteomic approaches. We show that XRNAX captures all RNA biotypes and use this to characterize the sub-proteomes that interact with coding and non-coding RNAs (ncRNAs) and to identify hundreds of protein-RNA interfaces. Exploiting the quantitative nature of XRNAX, we observe drastic remodeling of the RNA-bound proteome during arsenite-induced stress, distinct from autophagy-related changes in the total proteome. In addition, we combine XRNAX with crosslinking immunoprecipitation sequencing (CLIP-seq) to validate the interaction of ncRNA with lamin B1 and EXOSC2. Thus, XRNAX is a resourceful approach to study structural and compositional aspects of protein-RNA interactions to address fundamental questions in RNA-biology.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Proteínas de Unión al ARN/aislamiento & purificación , ARN/aislamiento & purificación , Sitios de Unión , Complejo Multienzimático de Ribonucleasas del Exosoma/metabolismo , Humanos , Inmunoprecipitación/métodos , Lamina Tipo B/metabolismo , Unión Proteica/genética , Unión Proteica/fisiología , Biosíntesis de Proteínas/genética , Biosíntesis de Proteínas/fisiología , Procesamiento Proteico-Postraduccional , Proteínas/aislamiento & purificación , Proteínas/metabolismo , Proteoma/metabolismo , Proteómica/métodos , ARN/genética , ARN/metabolismo , ARN Mensajero/metabolismo , ARN no Traducido/metabolismo , Proteínas de Unión al ARN/metabolismo , TranscriptomaRESUMEN
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the ß-flower fold, added several protein families to Pfam database2 and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
Asunto(s)
Bases de Datos de Proteínas , Aprendizaje Profundo , Anotación de Secuencia Molecular , Pliegue de Proteína , Proteínas , Homología Estructural de Proteína , Secuencia de Aminoácidos , Internet , Proteínas/química , Proteínas/clasificación , Proteínas/metabolismoRESUMEN
Retrons are prokaryotic genetic retroelements encoding a reverse transcriptase that produces multi-copy single-stranded DNA1 (msDNA). Despite decades of research on the biosynthesis of msDNA2, the function and physiological roles of retrons have remained unknown. Here we show that Retron-Sen2 of Salmonella enterica serovar Typhimurium encodes an accessory toxin protein, STM14_4640, which we renamed as RcaT. RcaT is neutralized by the reverse transcriptase-msDNA antitoxin complex, and becomes active upon perturbation of msDNA biosynthesis. The reverse transcriptase is required for binding to RcaT, and the msDNA is required for the antitoxin activity. The highly prevalent RcaT-containing retron family constitutes a new type of tripartite DNA-containing toxin-antitoxin system. To understand the physiological roles of such toxin-antitoxin systems, we developed toxin activation-inhibition conjugation (TAC-TIC), a high-throughput reverse genetics approach that identifies the molecular triggers and blockers of toxin-antitoxin systems. By applying TAC-TIC to Retron-Sen2, we identified multiple trigger and blocker proteins of phage origin. We demonstrate that phage-related triggers directly modify the msDNA, thereby activating RcaT and inhibiting bacterial growth. By contrast, prophage proteins circumvent retrons by directly blocking RcaT. Consistently, retron toxin-antitoxin systems act as abortive infection anti-phage defence systems, in line with recent reports3,4. Thus, RcaT retrons are tripartite DNA-regulated toxin-antitoxin systems, which use the reverse transcriptase-msDNA complex both as an antitoxin and as a sensor of phage protein activities.
Asunto(s)
Antitoxinas , Bacteriófagos , Retroelementos , Salmonella typhimurium , Sistemas Toxina-Antitoxina , Antitoxinas/genética , Bacteriófagos/metabolismo , ADN Bacteriano/genética , ADN de Cadena Simple/genética , Conformación de Ácido Nucleico , Profagos/metabolismo , ADN Polimerasa Dirigida por ARN/metabolismo , Retroelementos/genética , Salmonella typhimurium/genética , Salmonella typhimurium/crecimiento & desarrollo , Salmonella typhimurium/virología , Sistemas Toxina-Antitoxina/genéticaRESUMEN
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure1. Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold2, at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.
Asunto(s)
Biología Computacional/normas , Aprendizaje Profundo/normas , Modelos Moleculares , Conformación Proteica , Proteoma/química , Conjuntos de Datos como Asunto/normas , Diacilglicerol O-Acetiltransferasa/química , Glucosa-6-Fosfatasa/química , Humanos , Proteínas de la Membrana/química , Pliegue de Proteína , Reproducibilidad de los ResultadosRESUMEN
The protein structure prediction problem has been solved for many types of proteins by AlphaFold. Recently, there has been considerable excitement to build off the success of AlphaFold and predict the 3D structures of RNAs. RNA prediction methods use a variety of techniques, from physics-based to machine learning approaches. We believe that there are challenges preventing the successful development of deep learning-based methods like AlphaFold for RNA in the short term. Broadly speaking, the challenges are the limited number of structures and alignments making data-hungry deep learning methods unlikely to succeed. Additionally, there are several issues with the existing structure and sequence data, as they are often of insufficient quality, highly biased and missing key information. Here, we discuss these challenges in detail and suggest some steps to remedy the situation. We believe that it is possible to create an accurate RNA structure prediction method, but it will require solving several data quality and volume issues, usage of data beyond simple sequence alignments, or the development of new less data-hungry machine learning methods.
RESUMEN
The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the status of services that EMBL-EBI data resources provide to scientific communities globally. The scale, openness, rich metadata and extensive curation of EMBL-EBI added-value databases makes them particularly well-suited as training sets for deep learning, machine learning and artificial intelligence applications, a selection of which are described here. The data resources at EMBL-EBI can catalyse such developments because they offer sustainable, high-quality data, collected in some cases over decades and made openly availability to any researcher, globally. Our aim is for EMBL-EBI data resources to keep providing the foundations for tools and research insights that transform fields across the life sciences.
Asunto(s)
Inteligencia Artificial , Biología Computacional , Manejo de Datos , Bases de Datos Factuales , Genoma , InternetRESUMEN
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
Asunto(s)
Bases de Datos de Proteínas , Humanos , Secuencia de Aminoácidos , Inteligencia Artificial , Internet , Proteínas/química , Programas InformáticosRESUMEN
The European Bioinformatics Institute (EMBL-EBI) maintains a comprehensive range of freely available and up-to-date molecular data resources, which includes over 40 resources covering every major data type in the life sciences. This year's service update for EMBL-EBI includes new resources, PGS Catalog and AlphaFold DB, and updates on existing resources, including the COVID-19 Data Platform, trRosetta and RoseTTAfold models introduced in Pfam and InterPro, and the launch of Genome Integrations with Function and Sequence by UniProt and Ensembl. Furthermore, we highlight projects through which EMBL-EBI has contributed to the development of community-driven data standards and guidelines, including the Recommended Metadata for Biological Images (REMBI), and the BioModels Reproducibility Scorecard. Training is one of EMBL-EBI's core missions and a key component of the provision of bioinformatics services to users: this year's update includes many of the improvements that have been developed to EMBL-EBI's online training offering.
Asunto(s)
Biología Computacional/educación , Biología Computacional/métodos , Bases de Datos Factuales , Academias e Institutos , Inteligencia Artificial , COVID-19 , Bases de Datos Factuales/economía , Bases de Datos Factuales/estadística & datos numéricos , Bases de Datos Farmacéuticas , Bases de Datos de Proteínas , Europa (Continente) , Genoma Humano , Humanos , Almacenamiento y Recuperación de la Información , ARN no Traducido/genética , SARS-CoV-2/genéticaRESUMEN
Changes at the cell surface enable bacteria to survive in dynamic environments, such as diverse niches of the human host. Here, we reveal "Periscope Proteins" as a widespread mechanism of bacterial surface alteration mediated through protein length variation. Tandem arrays of highly similar folded domains can form an elongated rod-like structure; thus, variation in the number of domains determines how far an N-terminal host ligand binding domain projects from the cell surface. Supported by newly available long-read genome sequencing data, we propose that this class could contain over 50 distinct proteins, including those implicated in host colonization and biofilm formation by human pathogens. In large multidomain proteins, sequence divergence between adjacent domains appears to reduce interdomain misfolding. Periscope Proteins break this "rule," suggesting that their length variability plays an important role in regulating bacterial interactions with host surfaces, other bacteria, and the immune system.
Asunto(s)
Proteínas Bacterianas , Proteínas de la Membrana , Streptococcus gordonii , Proteínas Bacterianas/química , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Proteínas de la Membrana/química , Proteínas de la Membrana/genética , Proteínas de la Membrana/metabolismo , Streptococcus gordonii/química , Streptococcus gordonii/genética , Streptococcus gordonii/metabolismoRESUMEN
Tandem Repeat Proteins (TRPs) are a class of proteins with repetitive amino acid sequences that have been studied extensively for over two decades. Different features at the level of sequence, structure, function and evolution have been attributed to them by various authors. And yet many of its salient features appear only when looking at specific subclasses of protein tandem repeats. Here, we attempt to rationalize the existing knowledge on Tandem Repeat Proteins (TRPs) by pointing out several dichotomies. The emerging picture is more nuanced than generally assumed and allows us to draw some boundaries of what is not a "proper" TRP. We conclude with an operational definition of a specific subset, which we have denominated STRPs (Structural Tandem Repeat Proteins), which separates a subclass of tandem repeats with distinctive features from several other less well-defined types of repeats. We believe that this definition will help researchers in the field to better characterize the biological meaning of this large yet largely understudied group of proteins.
Asunto(s)
Proteínas , Secuencias Repetidas en Tándem , Proteínas/genética , Proteínas/química , Secuencias Repetidas en Tándem/genética , Secuencia de AminoácidosRESUMEN
Bacterial fibrillar adhesins are specialized extracellular polypeptides that promote the attachment of bacteria to the surfaces of other cells or materials. Adhesin-mediated interactions are critical for the establishment and persistence of stable bacterial populations within diverse environmental niches and are important determinants of virulence. The fibronectin (Fn)-binding fibrillar adhesin CshA, and its paralogue CshB, play important roles in host colonization by the oral commensal and opportunistic pathogen Streptococcus gordonii. As paralogues are often catalysts for functional diversification, we have probed the early stages of structural and functional divergence in Csh proteins by determining the X-ray crystal structure of the CshB adhesive domain NR2 and characterizing its Fn-binding properties in vitro. Despite sharing a common fold, CshB_NR2 displays an ~1.7-fold reduction in Fn-binding affinity relative to CshA_NR2. This correlates with reduced electrostatic charge in the Fn-binding cleft. Complementary bioinformatic studies reveal that homologues of CshA/B_NR2 domains are widely distributed in both Gram-positive and Gram-negative bacteria, where they are found housed within functionally cryptic multi-domain polypeptides. Our findings are consistent with the classification of Csh adhesins and their relatives as members of the recently defined polymer adhesin domain (PAD) family of bacterial proteins.
Asunto(s)
Antibacterianos , Proteínas de la Membrana , Ligandos , Proteínas de la Membrana/química , Bacterias Gramnegativas/metabolismo , Bacterias Grampositivas/metabolismo , Adhesinas Bacterianas/genética , Adhesinas Bacterianas/química , Adhesinas Bacterianas/metabolismo , Proteínas Bacterianas/químicaRESUMEN
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Asunto(s)
COVID-19/prevención & control , Biología Computacional , SARS-CoV-2/aislamiento & purificación , Investigación Biomédica , COVID-19/epidemiología , COVID-19/virología , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genéticaRESUMEN
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds â¼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
Asunto(s)
Proteínas , Bases de Datos de Proteínas , Proteínas/genética , Análisis por Conglomerados , Secuencia de Aminoácidos , Dominios ProteicosRESUMEN
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Asunto(s)
Biología Computacional/estadística & datos numéricos , Bases de Datos de Proteínas , Proteínas/metabolismo , Proteoma/metabolismo , Animales , COVID-19/epidemiología , COVID-19/prevención & control , COVID-19/virología , Biología Computacional/métodos , Epidemias , Humanos , Internet , Modelos Moleculares , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/genética , Proteoma/clasificación , Proteoma/genética , Secuencias Repetitivas de Aminoácido/genética , SARS-CoV-2/genética , SARS-CoV-2/fisiología , Análisis de Secuencia de Proteína/métodosRESUMEN
Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Metagenoma , MicroARNs/genética , ARN Bacteriano/genética , ARN no Traducido/genética , ARN Viral/genética , Bacterias/genética , Bacterias/metabolismo , Emparejamiento Base , Secuencia de Bases , Humanos , Internet , MicroARNs/clasificación , MicroARNs/metabolismo , Anotación de Secuencia Molecular , Conformación de Ácido Nucleico , ARN Bacteriano/clasificación , ARN Bacteriano/metabolismo , ARN no Traducido/clasificación , ARN no Traducido/metabolismo , ARN Viral/clasificación , ARN Viral/metabolismo , Alineación de Secuencia , Análisis de Secuencia de ARN , Programas Informáticos , Virus/genética , Virus/metabolismoRESUMEN
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Secuencia de Aminoácidos , COVID-19/metabolismo , Internet , Anotación de Secuencia Molecular , Dominios Proteicos , Mapas de Interacción de Proteínas , SARS-CoV-2/metabolismo , Alineación de SecuenciaRESUMEN
Fibrillar adhesins are bacterial cell surface proteins that mediate interactions with the environment, including host cells during colonization or other bacteria during biofilm formation. These proteins are characterized by a stalk that projects the adhesive domain closer to the binding target. Fibrillar adhesins evolve quickly and thus can be difficult to computationally identify, yet they represent an important component for understanding bacterium-host interactions. To detect novel fibrillar adhesins, we developed a random forest prediction approach based on common characteristics we identified for this protein class. We applied this approach to Firmicutes and Actinobacteria proteomes, yielding over 6,500 confidently predicted fibrillar adhesins. To verify the approach, we investigated predicted fibrillar adhesins that lacked a known adhesive domain. Based on these proteins, we identified 24 sequence clusters representing potential novel members of adhesive domain families. We used AlphaFold to verify that 15 clusters showed structural similarity to known adhesive domains, such as the TED domain. Overall, our study has made a significant contribution to the number of known fibrillar adhesins and has enabled us to identify novel members of adhesive domain families involved in bacterial pathogenesis. IMPORTANCE Fibrillar adhesins are a class of bacterial cell surface proteins that enable bacteria to interact with their environment. We developed a machine learning approach to identify fibrillar adhesins and applied this classification approach to the Firmicutes and Actinobacteria Reference Proteomes database. This method allowed us to detect a high number of novel fibrillar adhesins and also novel members of adhesive domain families. To confirm our predictions of these potential adhesin protein domains, we predicted their structure using the AlphaFold tool.
Asunto(s)
Adhesivos , Proteoma , Adhesinas Bacterianas/metabolismo , Bacterias/genética , Bacterias/metabolismo , Adhesión Bacteriana , Humanos , Proteínas de la Membrana/química , Dominios ProteicosRESUMEN
Mobile genetic elements (MGEs) sequester and mobilize antibiotic resistance genes across bacterial genomes. Efficient and reliable identification of such elements is necessary to follow resistance spreading. However, automated tools for MGE identification are missing. Tyrosine recombinase (YR) proteins drive MGE mobilization and could provide markers for MGE detection, but they constitute a diverse family also involved in housekeeping functions. Here, we conducted a comprehensive survey of YRs from bacterial, archaeal, and phage genomes and developed a sequence-based classification system that dissects the characteristics of MGE-borne YRs. We revealed that MGE-related YRs evolved from non-mobile YRs by acquisition of a regulatory arm-binding domain that is essential for their mobility function. Based on these results, we further identified numerous unknown MGEs. This work provides a resource for comparative analysis and functional annotation of YRs and aids the development of computational tools for MGE annotation. Additionally, we reveal how YRs adapted to drive gene transfer across species and provide a tool to better characterize antibiotic resistance dissemination.
Asunto(s)
Archaea/genética , Bacterias/genética , Hongos/genética , Recombinasas/metabolismo , Análisis de Secuencia de Proteína/métodos , Archaea/enzimología , Bacterias/enzimología , Farmacorresistencia Microbiana , Evolución Molecular , Hongos/enzimología , Secuencias Repetitivas Esparcidas , Anotación de Secuencia Molecular , Biología de SistemasRESUMEN
Streptococcus groups A and B cause serious infections, including early onset sepsis and meningitis in newborns. Rib domain-containing surface proteins are found associated with invasive strains and elicit protective immunity in animal models. Yet, despite their apparent importance in infection, the structure of the Rib domain was previously unknown. Structures of single Rib domains of differing length reveal a rare case of domain atrophy through deletion of 2 core antiparallel strands, resulting in the loss of an entire sheet of the ß-sandwich from an immunoglobulin-like fold. Previously, observed variation in the number of Rib domains within these bacterial cell wall-attached proteins has been suggested as a mechanism of immune evasion. Here, the structure of tandem domains, combined with molecular dynamics simulations and small angle X-ray scattering, suggests that variability in Rib domain number would result in differential projection of an N-terminal host-colonization domain from the bacterial surface. The identification of 2 further structures where the typical B-D-E immunoglobulin ß-sheet is replaced with an α-helix further confirms the extensive structural malleability of the Rib domain.
RESUMEN
BACKGROUND: Fibrillar adhesins are long multidomain proteins that form filamentous structures at the cell surface of bacteria. They are an important yet understudied class of proteins composed of adhesive and stalk domains that mediate interactions of bacteria with their environment. This study aims to characterize fibrillar adhesins in a wide range of bacterial phyla and to identify new fibrillar adhesin-like proteins to improve our understanding of host-bacteria interactions. RESULTS: Through careful literature and computational searches, we identified 82 stalk and 27 adhesive domain families in fibrillar adhesins. Based on the presence of these domains in the UniProt Reference Proteomes database, we identified and analysed 3,542 fibrillar adhesin-like proteins across species of the most common bacterial phyla. We further enumerate the adhesive and stalk domain combinations found in nature and demonstrate that fibrillar adhesins have complex and variable domain architectures, which differ across species. By analysing the domain architecture of fibrillar adhesins, we show that in Gram positive bacteria, adhesive domains are mostly positioned at the N-terminus and cell surface anchors at the C-terminus of the protein, while their positions are more variable in Gram negative bacteria. We provide an open repository of fibrillar adhesin-like proteins and domains to enable further studies of this class of bacterial surface proteins. CONCLUSION: This study provides a domain-based characterization of fibrillar adhesins and demonstrates that they are widely found in species across the main bacterial phyla. We have discovered numerous novel fibrillar adhesins and improved our understanding of pathogenic adhesion and invasion mechanisms.