RESUMEN
There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, and more generally the overlaps between different properties related to LCRs, using examples. We argue that statistical measures alone cannot capture all structural aspects of LCRs and recommend the combined usage of a variety of predictive tools and measurements. While the methodologies available to study LCRs are already very advanced, we foresee that a more comprehensive annotation of sequences in the databases will enable the improvement of predictions and a better understanding of the evolution and the connection between structure and function of LCRs. This will require the use of standards for the generation and exchange of data describing all aspects of LCRs. SHORT ABSTRACT: There are multiple definitions for low complexity regions (LCRs) in protein sequences. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, plus overlaps between different properties related to LCRs, using examples.
Asunto(s)
Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Evolución Molecular , Conformación Proteica , Dominios ProteicosRESUMEN
The Database of Protein Disorder (DisProt, URL: https://disprot.org) provides manually curated annotations of intrinsically disordered proteins from the literature. Here we report recent developments with DisProt (version 8), including the doubling of protein entries, a new disorder ontology, improvements of the annotation format and a completely new website. The website includes a redesigned graphical interface, a better search engine, a clearer API for programmatic access and a new annotation interface that integrates text mining technologies. The new entry format provides a greater flexibility, simplifies maintenance and allows the capture of more information from the literature. The new disorder ontology has been formalized and made interoperable by adopting the OWL format, as well as its structure and term definitions have been improved. The new annotation interface has made the curation process faster and more effective. We recently showed that new DisProt annotations can be effectively used to train and validate disorder predictors. We believe the growth of DisProt will accelerate, contributing to the improvement of function and disorder predictors and therefore to illuminate the 'dark' proteome.
Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Ontologías Biológicas , Curaduría de Datos , Anotación de Secuencia MolecularRESUMEN
Protein-protein interactions (PPIs) formed between short linear motifs and globular domains play important roles in many regulatory and signaling processes but are highly underrepresented in current protein-protein interaction databases. These types of interactions are usually characterized by a specific binding motif that captures the key amino acids shared among the interaction partners. However, the computational proteome-level identification of interaction partners based on the known motif is hindered by the huge number of randomly occurring matches from which biologically relevant motif hits need to be extracted. In this work, we established a novel bioinformatic filtering protocol to efficiently explore interaction network of a hub protein. We introduced a novel measure that enabled the optimization of the elements and parameter settings of the pipeline which was built from multiple sequence-based prediction methods. In addition, data collected from PPI databases and evolutionary analyses were also incorporated to further increase the biological relevance of the identified motif hits. The approach was applied to the dynein light chain LC8, a ubiquitous eukaryotic hub protein that has been suggested to be involved in motor-related functions as well as promoting the dimerization of various proteins by recognizing linear motifs in its partners. From the list of putative binding motifs collected by our protocol, several novel peptides were experimentally verified to bind LC8. Altogether 71 potential new motif instances were identified. The expanded list of LC8 binding partners revealed the evolutionary plasticity of binding partners despite the highly conserved binding interface. In addition, it also highlighted a novel, conserved function of LC8 in the upstream regulation of the Hippo signaling pathway. Beyond the LC8 system, our work also provides general guidelines that can be applied to explore the interaction network of other linear motif binding proteins or protein domains.
Asunto(s)
Dineínas Citoplasmáticas/química , Dineínas Citoplasmáticas/metabolismo , Proteínas Serina-Treonina Quinasas/química , Proteínas Serina-Treonina Quinasas/metabolismo , Biología Computacional , Secuencia Conservada , Dineínas Citoplasmáticas/genética , Bases de Datos de Proteínas/estadística & datos numéricos , Evolución Molecular , Vía de Señalización Hippo , Humanos , Filogenia , Unión Proteica , Dominios y Motivos de Interacción de Proteínas , Mapas de Interacción de Proteínas , Proteínas Serina-Treonina Quinasas/genética , Transducción de SeñalRESUMEN
Many proteins contain intrinsically disordered regions (IDRs) which carry out important functions without relying on a single well-defined conformation. IDRs are increasingly recognized as critical elements of regulatory networks and have been also associated with cancer. However, it is unknown whether mutations targeting IDRs represent a distinct class of driver events associated with specific molecular and system-level properties, cancer types and treatment options. Here, we used an integrative computational approach to explore the direct role of intrinsically disordered protein regions driving cancer. We showed that around 20% of cancer drivers are primarily targeted through a disordered region. These IDRs can function in multiple ways which are distinct from the functional mechanisms of ordered drivers. Disordered drivers play a central role in context-dependent interaction networks and are enriched in specific biological processes such as transcription, gene expression regulation and protein degradation. Furthermore, their modulation represents an alternative mechanism for the emergence of all known cancer hallmarks. Importantly, in certain cancer patients, mutations of disordered drivers represent key driving events. However, treatment options for such patients are currently severely limited. The presented study highlights a largely overlooked class of cancer drivers associated with specific cancer types that need novel therapeutic options.