RESUMEN
SUMMARY: Biological data is ever-increasing in amount and complexity. The mapping of this data to biological entities such as nucleotide and amino acid sequences supports biological data analysis, classification and prediction. Sequence alignments and comparison allow the transfer of knowledge to evolutionary-related entities, the mapping of functional domains, the identification of binding and modification sites. To support these types of studies, we developed ProSeqViewer, a tool to visualize annotation on single sequences and multiple sequence alignments. This state-of-the-art multifunctional library was developed as a modular component to be integrated into static or dynamic web resources and support intuitive visualization of sequence features. ProseSeqViewer is extremely lightweight, fast, interactive, dynamic, responsive and works at any screen size. It generates pure HTML which is compatible with any browser and operating system. ProSeqViewer can exchange events with other visualization components and is already used by multiple biological databases. AVAILABILITY AND IMPLEMENTATION: ProSeqViewer is an open-source TypeScript library compatible with state-of-the-art website environments. The source code and an extensive documentation including use cases are available from the URL: https://github.com/BioComputingUP/ProSeqViewer.
Asunto(s)
Programas Informáticos , Biblioteca de Genes , Alineación de Secuencia , Secuencia de AminoácidosRESUMEN
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Asunto(s)
Biología Computacional/estadística & datos numéricos , Bases de Datos de Proteínas , Proteínas/metabolismo , Proteoma/metabolismo , Animales , COVID-19/epidemiología , COVID-19/prevención & control , COVID-19/virología , Biología Computacional/métodos , Epidemias , Humanos , Internet , Modelos Moleculares , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/genética , Proteoma/clasificación , Proteoma/genética , Secuencias Repetitivas de Aminoácido/genética , SARS-CoV-2/genética , SARS-CoV-2/fisiología , Análisis de Secuencia de Proteína/métodosRESUMEN
The MobiDB database (URL: https://mobidb.org/) provides predictions and annotations for intrinsically disordered proteins. Here, we report recent developments implemented in MobiDB version 4, regarding the database format, with novel types of annotations and an improved update process. The new website includes a re-designed user interface, a more effective search engine and advanced API for programmatic access. The new database schema gives more flexibility for the users, as well as simplifying the maintenance and updates. In addition, the new entry page provides more visualisation tools including customizable feature viewer and graphs of the residue contact maps. MobiDB v4 annotates the binding modes of disordered proteins, whether they undergo disorder-to-order transitions or remain disordered in the bound state. In addition, disordered regions undergoing liquid-liquid phase separation or post-translational modifications are defined. The integrated information is presented in a simplified interface, which enables faster searches and allows large customized datasets to be downloaded in TSV, Fasta or JSON formats. An alternative advanced interface allows users to drill deeper into features of interest. A new statistics page provides information at database and proteome levels. The new MobiDB version presents state-of-the-art knowledge on disordered proteins and improves data accessibility for both computational and experimental users.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Algoritmos , Internet , Anotación de Secuencia Molecular , Procesamiento Proteico-Postraduccional , Programas InformáticosRESUMEN
The RepeatsDB database (URL: https://repeatsdb.org/) provides annotations and classification for protein tandem repeat structures from the Protein Data Bank (PDB). Protein tandem repeats are ubiquitous in all branches of the tree of life. The accumulation of solved repeat structures provides new possibilities for classification and detection, but also increasing the need for annotation. Here we present RepeatsDB 3.0, which addresses these challenges and presents an extended classification scheme. The major conceptual change compared to the previous version is the hierarchical classification combining top levels based solely on structural similarity (Class > Topology > Fold) with two new levels (Clan > Family) requiring sequence similarity and describing repeat motifs in collaboration with Pfam. Data growth has been addressed with improved mechanisms for browsing the classification hierarchy. A new UniProt-centric view unifies the increasingly frequent annotation of structures from identical or similar sequences. This update of RepeatsDB aligns with our commitment to develop a resource that extracts, organizes and distributes specialized information on tandem repeat protein structures.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Secuencias Repetitivas de Aminoácido , Secuencias Repetidas en Tándem , Ontología de Genes , Células HEK293 , Células HeLa , Humanos , Reproducibilidad de los Resultados , Estadística como Asunto , Interfaz Usuario-ComputadorRESUMEN
There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, and more generally the overlaps between different properties related to LCRs, using examples. We argue that statistical measures alone cannot capture all structural aspects of LCRs and recommend the combined usage of a variety of predictive tools and measurements. While the methodologies available to study LCRs are already very advanced, we foresee that a more comprehensive annotation of sequences in the databases will enable the improvement of predictions and a better understanding of the evolution and the connection between structure and function of LCRs. This will require the use of standards for the generation and exchange of data describing all aspects of LCRs. SHORT ABSTRACT: There are multiple definitions for low complexity regions (LCRs) in protein sequences. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, plus overlaps between different properties related to LCRs, using examples.
Asunto(s)
Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Evolución Molecular , Conformación Proteica , Dominios ProteicosRESUMEN
Low complexity regions (LCRs) in protein sequences are characterized by a less diverse amino acid composition compared to typically observed sequence diversity. Recent studies have shown that LCRs may co-occur with intrinsically disordered regions, are highly conserved in many organisms, and often play important roles in protein functions and in diseases. In previous decades, several methods have been developed to identify regions with LCRs or amino acid bias, but most of them as stand-alone applications and currently there is no web-based tool which allows users to explore LCRs in protein sequences with additional functional annotations. We aim to fill this gap by providing PlaToLoCo - PLAtform of TOols for LOw COmplexity-a meta-server that integrates and collects the output of five different state-of-the-art tools for discovering LCRs and provides functional annotations such as domain detection, transmembrane segment prediction, and calculation of amino acid frequencies. In addition, the union or intersection of the results of the search on a query sequence can be obtained. By developing the PlaToLoCo meta-server, we provide the community with a fast and easily accessible tool for the analysis of LCRs with additional information included to aid the interpretation of the results. The PlaToLoCo platform is available at: http://platoloco.aei.polsl.pl/.
Asunto(s)
Proteínas/química , Programas Informáticos , Aminoácidos/análisis , Gráficos por Computador , Humanos , Proteínas de la Membrana/química , Anotación de Secuencia Molecular , Dominios Proteicos , Análisis de Secuencia de ProteínaRESUMEN
The Database of Protein Disorder (DisProt, URL: https://disprot.org) provides manually curated annotations of intrinsically disordered proteins from the literature. Here we report recent developments with DisProt (version 8), including the doubling of protein entries, a new disorder ontology, improvements of the annotation format and a completely new website. The website includes a redesigned graphical interface, a better search engine, a clearer API for programmatic access and a new annotation interface that integrates text mining technologies. The new entry format provides a greater flexibility, simplifies maintenance and allows the capture of more information from the literature. The new disorder ontology has been formalized and made interoperable by adopting the OWL format, as well as its structure and term definitions have been improved. The new annotation interface has made the curation process faster and more effective. We recently showed that new DisProt annotations can be effectively used to train and validate disorder predictors. We believe the growth of DisProt will accelerate, contributing to the improvement of function and disorder predictors and therefore to illuminate the 'dark' proteome.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Ontologías Biológicas , Curaduría de Datos , Anotación de Secuencia MolecularRESUMEN
SUMMARY: The Feature-Viewer is a lightweight library for the visualization of biological data mapped to a protein or nucleotide sequence. It is designed for ease of use while allowing for a full customization. The library is already used by several biological data resources and allows intuitive visual mapping of a full spectra of sequence features for different usages. AVAILABILITY AND IMPLEMENTATION: The Feature-Viewer is open source, compatible with state-of-the-art development technologies and responsive, also for mobile viewing. Documentation and usage examples are available online.
Asunto(s)
Computadores , Programas InformáticosRESUMEN
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Anotación de Secuencia Molecular , Dominios Proteicos , Proteínas/química , Secuencias Repetitivas de AminoácidoRESUMEN
Tandem Repeat Proteins (TRPs) are ubiquitous in cells and are enriched in eukaryotes. They contributed to the evolution of organism complexity, specializing for functions that require quick adaptability such as immunity-related functions. To investigate the hypothesis of repeat protein evolution through exon duplication and rearrangement, we designed a tool to analyze the relationships between exon/intron patterns and structural symmetries. The tool allows comparison of the structure fragments as defined by exon/intron boundaries from Ensembl against the structural element repetitions from RepeatsDB. The all-against-all pairwise structural alignment between fragments and comparison of the two definitions (structural units and exons) are visualized in a single matrix, the "repeat/exon plot". An analysis of different repeat protein families, including the solenoids Leucine-Rich, Ankyrin, Pumilio, HEAT repeats and the ß propellers Kelch-like, WD40 and RCC1, shows different behaviors, illustrated here through examples. For each example, the analysis of the exon mapping in homologous proteins supports the conservation of their exon patterns. We propose that when a clear-cut relationship between exon and structural boundaries can be identified, it is possible to infer a specific "evolutionary pattern" which may improve TRPs detection and classification.
Asunto(s)
Exones/genética , Proteínas/genética , Secuencias Repetidas en Tándem/genética , Animales , Evolución Molecular , Humanos , Intrones/genéticaRESUMEN
RepeatsDB-lite (http://protein.bio.unipd.it/repeatsdb-lite) is a web server for the prediction of repetitive structural elements and units in tandem repeat (TR) proteins. TRs are a widespread but poorly annotated class of non-globular proteins carrying heterogeneous functions. RepeatsDB-lite extends the prediction to all TR types and strongly improves the performance both in terms of computational time and accuracy over previous methods, with precision above 95% for solenoid structures. The algorithm exploits an improved TR unit library derived from the RepeatsDB database to perform an iterative structural search and assignment. The web interface provides tools for analyzing the evolutionary relationships between units and manually refine the prediction by changing unit positions and protein classification. An all-against-all structure-based sequence similarity matrix is calculated and visualized in real-time for every user edit. Reviewed predictions can be submitted to RepeatsDB for review and inclusion.
Asunto(s)
Algoritmos , Proteínas/química , Secuencias Repetitivas de Aminoácido , Programas Informáticos , Homología Estructural de Proteína , Sitios de Unión , Bases de Datos de Proteínas , Humanos , Internet , Ligandos , Modelos Moleculares , Anotación de Secuencia Molecular , Unión Proteica , Dominios Proteicos , Estructura Secundaria de Proteína , Factores de TiempoRESUMEN
The MobiDB (URL: mobidb.bio.unipd.it) database of protein disorder and mobility annotations has been significantly updated and upgraded since its last major renewal in 2014. Several curated datasets for intrinsic disorder and folding upon binding have been integrated from specialized databases. The indirect evidence has also been expanded to better capture information available in the PDB, such as high temperature residues in X-ray structures and overall conformational diversity. Novel nuclear magnetic resonance chemical shift data provides an additional experimental information layer on conformational dynamics. Predictions have been expanded to provide new types of annotation on backbone rigidity, secondary structure preference and disordered binding regions. MobiDB 3.0 contains information for the complete UniProt protein set and synchronization has been improved by covering all UniParc sequences. An advanced search function allows the creation of a wide array of custom-made datasets for download and further analysis. A large amount of information and cross-links to more specialized databases are intended to make MobiDB the central resource for the scientific community working on protein intrinsic disorder and mobility.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Anotación de Secuencia Molecular , Programas Informáticos , Secuencia de Aminoácidos , Sitios de Unión , Conjuntos de Datos como Asunto , Ontología de Genes , Humanos , Internet , Proteínas Intrínsecamente Desordenadas/genética , Proteínas Intrínsecamente Desordenadas/metabolismo , Modelos Moleculares , Unión Proteica , Pliegue de Proteína , Dominios y Motivos de Interacción de Proteínas , Alineación de SecuenciaRESUMEN
Solubility is an important, albeit not well understood, feature determining protein behavior. It is of paramount importance in protein engineering, where similar folded proteins may behave in very different ways in solution. Here we present SODA, a novel method to predict the changes of protein solubility based on several physico-chemical properties of the protein. SODA uses the propensity of the protein sequence to aggregate as well as intrinsic disorder, plus hydrophobicity and secondary structure preferences to estimate changes in solubility. It has been trained and benchmarked on two different datasets. The comparison to other recently published methods shows that SODA has state-of-the-art performance and is particularly well suited to predict mutations decreasing solubility. The method is fast, returning results for single mutations in seconds. A usage example estimating the full repertoire of mutations for a human germline antibody highlights several solubility hotspots on the surface. The web server, complete with RESTful interface and extensive help, can be accessed from URL: http://protein.bio.unipd.it/soda.
Asunto(s)
Mutación , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Algoritmos , Anticuerpos/química , Anticuerpos/genética , Interacciones Hidrofóbicas e Hidrofílicas , Internet , Proteínas Intrínsecamente Desordenadas/química , Agregado de Proteínas , Estructura Secundaria de Proteína , SolubilidadRESUMEN
RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for â¼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity.
Asunto(s)
Bases de Datos de Proteínas , Secuencias Repetitivas de Aminoácido , Animales , Bases de Datos de Proteínas/estadística & datos numéricos , Humanos , Proteínas/clasificación , Programas Informáticos , Relación Estructura-ActividadRESUMEN
Over the last decade, numerous studies have demonstrated the fundamental importance of tandem repeat (TR) proteins in many biological processes. A plethora of new repeat structures have also been solved. The recently published RepeatsDB provides information on TR proteins. However, a detailed structural characterization of repetitive elements is largely missing, as repeat unit annotation is manually curated and currently covers only 3 % of the bona fide TR proteins. Repeat Protein Unit Predictor (ReUPred) is a novel method for the fast automatic prediction of repeat units and repeat classification using an extensive Structure Repeat Unit Library (SRUL) derived from RepeatsDB. ReUPred uses an iterative structural search against the SRUL to find repetitive units. On a test set of solenoid proteins, ReUPred is able to correctly detect 92 % of the proteins. Unlike previous methods, it is also able to correctly classify solenoid repeats in 89 % of cases. It also outperforms two recent state-of-the-art methods for the repeat unit identification problem. The accurate prediction of repeat units increases the number of annotated repeat units by an order of magnitude compared to the sequence-based Pfam classification. ReUPred is implemented in Python for Linux and freely available from the URL: http://protein.bio.unipd.it/reupred/ .
Asunto(s)
Biblioteca de Péptidos , Lenguajes de Programación , Secuencias Repetitivas de Aminoácido/genética , Análisis de Secuencia de Proteína/métodosRESUMEN
Tandem repeats (TR) in proteins are common in nature and have several unique functions. They come in various forms that are frequently difficult to recognize from a sequence. A previously proposed structural classification has been recently implemented in the RepeatsDB database. This defines five main classes, mainly based on repeat unit length, with subclasses representing specific folds. Sequence-based classifications, such as Pfam, provide an alternative classification based on evolutionarily conserved repeat families. Here, we discuss a detailed comparison between the structural classes in RepeatsDB and the corresponding Pfam repeat families and clans. Most instances are found to map one-to-one between structure and sequence. Some notable exceptions such as leucine-rich repeats (LRRs) and α-solenoids are discussed.
Asunto(s)
Modelos Moleculares , Conformación Proteica , Proteómica/métodos , Secuencias Repetitivas de Aminoácido , Secuencias Repetidas en Tándem , Animales , Secuencia Conservada , Bases de Datos de Proteínas , Humanos , Pliegue de Proteína , Estructura Secundaria de ProteínaRESUMEN
Science, technology, engineering, mathematics, and medicine (STEMM) fields change rapidly and are increasingly interdisciplinary. Commonly, STEMM practitioners use short-format training (SFT) such as workshops and short courses for upskilling and reskilling, but unaddressed challenges limit SFT's effectiveness and inclusiveness. Education researchers, students in SFT courses, and organizations have called for research and strategies that can strengthen SFT in terms of effectiveness, inclusiveness, and accessibility across multiple dimensions. This paper describes the project that resulted in a consensus set of 14 actionable recommendations to systematically strengthen SFT. A diverse international group of 30 experts in education, accessibility, and life sciences came together from 10 countries to develop recommendations that can help strengthen SFT globally. Participants, including representation from some of the largest life science training programs globally, assembled findings in the educational sciences and encompassed the experiences of several of the largest life science SFT programs. The 14 recommendations were derived through a Delphi method, where consensus was achieved in real time as the group completed a series of meetings and tasks designed to elicit specific recommendations. Recommendations cover the breadth of SFT contexts and stakeholder groups and include actions for instructors (e.g., make equity and inclusion an ethical obligation), programs (e.g., centralize infrastructure for assessment and evaluation), as well as organizations and funders (e.g., professionalize training SFT instructors; deploy SFT to counter inequity). Recommendations are aligned with a purpose-built framework-"The Bicycle Principles"-that prioritizes evidenced-based teaching, inclusiveness, and equity, as well as the ability to scale, share, and sustain SFT. We also describe how the Bicycle Principles and recommendations are consistent with educational change theories and can overcome systemic barriers to delivering consistently effective, inclusive, and career-spanning SFT.
Asunto(s)
Estudiantes , Tecnología , Humanos , Consenso , IngenieríaRESUMEN
Prion-like behavior has been in the spotlight since it was first associated with the onset of mammalian neurodegenerative diseases. However, a growing body of evidence suggests that this mechanism could be behind the regulation of processes such as transcription and translation in multiple species. Here, we perform a stringent computational survey to identify prion-like proteins in the human proteome. We detected 242 candidate polypeptides and computationally assessed their function, protein-protein interaction networks, tissular expression, and their link to disease. Human prion-like proteins constitute a subset of modular polypeptides broadly expressed across different cell types and tissues, significantly associated with disease, embedded in highly connected interaction networks, and involved in the flow of genetic information in the cell. Our analysis suggests that these proteins might play a relevant role not only in neurological disorders, but also in different types of cancer and viral infections.
RESUMEN
Regional Student Groups (RSGs) of the International Society for Computational Biology Student Council (ISCB-SC) have been instrumental to connect computational biologists globally and to create more awareness about bioinformatics education. This article highlights the initiatives carried out by the RSGs both nationally and internationally to strengthen the present and future of the bioinformatics community. Moreover, we discuss the future directions the organization will take and the challenges to advance further in the ISCB-SC main mission: "Nurture the new generation of computational biologists".