RESUMEN
Gene fusions are common cancer-causing mutations, but the molecular principles by which fusion protein products affect interaction networks and cause disease are not well understood. Here, we perform an integrative analysis of the structural, interactomic, and regulatory properties of thousands of putative fusion proteins. We demonstrate that genes that form fusions (i.e., parent genes) tend to be highly connected hub genes, whose protein products are enriched in structured and disordered interaction-mediating features. Fusion often results in the loss of these parental features and the depletion of regulatory sites such as post-translational modifications. Fusion products disproportionately connect proteins that did not previously interact in the protein interaction network. In this manner, fusion products can escape cellular regulation and constitutively rewire protein interaction networks. We suggest that the deregulation of central, interaction-prone proteins may represent a widespread mechanism by which fusion proteins alter the topology of cellular signaling pathways and promote cancer.
Asunto(s)
Fusión Génica , Proteínas de Neoplasias/genética , Proteínas de Neoplasias/metabolismo , Neoplasias/genética , Neoplasias/metabolismo , Mapas de Interacción de Proteínas , Biología Computacional , Bases de Datos de Proteínas , Humanos , Mapeo de Interacción de Proteínas , Procesamiento Proteico-Postraduccional , Transducción de Señal , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , UbiquitinaciónRESUMEN
The alignment between the boundaries of protein domains and the boundaries of exons could provide evidence for the evolution of proteins via domain shuffling, but literature in the field has so far struggled to conclusively show this. Here, on larger data sets than previously possible, we do finally show that this phenomenon is indisputably found widely across the eukaryotic tree. In contrast, the alignment between exons and the boundaries of intrinsically disordered regions of proteins is not a general property of eukaryotes. Most interesting of all is the discovery that domain-exon alignment is much more common in recently evolved protein sequences than older ones.
Asunto(s)
Células Eucariotas/metabolismo , Exones/genética , Intrones/genética , Proteínas/genética , Animales , Evolución Molecular , Genoma/genética , HumanosRESUMEN
Here, we present a major update to the SUPERFAMILY database and the webserver. We describe the addition of new SUPERFAMILY 2.0 profile HMM library containing a total of 27 623 HMMs. The database now includes Superfamily domain annotations for millions of protein sequences taken from the Universal Protein Recourse Knowledgebase (UniProtKB) and the National Center for Biotechnology Information (NCBI). This addition constitutes about 51 and 45 million distinct protein sequences obtained from UniProtKB and NCBI respectively. Currently, the database contains annotations for 63 244 and 102 151 complete genomes taken from UniProtKB and NCBI respectively. The current sequence collection and genome update is the biggest so far in the history of SUPERFAMILY updates. In order to the deal with the massive wealth of information, here we introduce a new SUPERFAMILY 2.0 webserver (http://supfam.org). Currently, the webserver mainly focuses on the search, retrieval and display of Superfamily annotation for the entire sequence and genome collection in the database.
Asunto(s)
Bases de Datos de Proteínas , Dominios Proteicos , Proteoma/química , Genoma , Internet , Cadenas de Markov , Dominios Proteicos/genética , Análisis de Secuencia de ProteínaRESUMEN
Of the three classes of enzymes involved in ubiquitination, ubiquitin-conjugating enzymes (E2) have been often incorrectly considered to play merely an auxiliary role in the process, and few E2 enzymes have been investigated in plants. To reveal the role of E2 in plant innate immunity, we identified and cloned 40 tomato genes encoding ubiquitin E2 proteins. Thioester assays indicated that the majority of the genes encode enzymatically active E2. Phylogenetic analysis classified the 40 tomato E2 enzymes into 13 groups, of which members of group III were found to interact and act specifically with AvrPtoB, a Pseudomonas syringae pv tomato effector that uses its ubiquitin ligase (E3) activity to suppress host immunity. Knocking down the expression of group III E2 genes in Nicotiana benthamiana diminished the AvrPtoB-promoted degradation of the Fen kinase and the AvrPtoB suppression of host immunity-associated programmed cell death. Importantly, silencing group III E2 genes also resulted in reduced pattern-triggered immunity (PTI). By contrast, programmed cell death induced by several effector-triggered immunity elicitors was not affected on group III-silenced plants. Functional characterization suggested redundancy among group III members for their role in the suppression of plant immunity by AvrPtoB and in PTI and identified UBIQUITIN-CONJUGATING11 (UBC11), UBC28, UBC29, UBC39, and UBC40 as playing a more significant role in PTI than other group III members. Our work builds a foundation for the further characterization of E2s in plant immunity and reveals that AvrPtoB has evolved a strategy for suppressing host immunity that is difficult for the plant to thwart.
Asunto(s)
Inmunidad de la Planta/fisiología , Proteínas de Plantas/inmunología , Solanum lycopersicum/genética , Enzimas Ubiquitina-Conjugadoras/inmunología , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Muerte Celular , Silenciador del Gen , Genoma de Planta , Interacciones Huésped-Patógeno/inmunología , Solanum lycopersicum/citología , Solanum lycopersicum/inmunología , Solanum lycopersicum/microbiología , Filogenia , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Plantas Modificadas Genéticamente , Proteínas Serina-Treonina Quinasas/genética , Proteínas Serina-Treonina Quinasas/metabolismo , Pseudomonas syringae/patogenicidad , Nicotiana/genética , Nicotiana/metabolismo , Enzimas Ubiquitina-Conjugadoras/genética , Enzimas Ubiquitina-Conjugadoras/metabolismo , UbiquitinaciónRESUMEN
The most diverse marine ecosystems, coral reefs, depend upon a functional symbiosis between a cnidarian animal host (the coral) and intracellular photosynthetic dinoflagellate algae. The molecular and cellular mechanisms underlying this endosymbiosis are not well understood, in part because of the difficulties of experimental work with corals. The small sea anemone Aiptasia provides a tractable laboratory model for investigating these mechanisms. Here we report on the assembly and analysis of the Aiptasia genome, which will provide a foundation for future studies and has revealed several features that may be key to understanding the evolution and function of the endosymbiosis. These features include genomic rearrangements and taxonomically restricted genes that may be functionally related to the symbiosis, aspects of host dependence on alga-derived nutrients, a novel and expanded cnidarian-specific family of putative pattern-recognition receptors that might be involved in the animal-algal interactions, and extensive lineage-specific horizontal gene transfer. Extensive integration of genes of prokaryotic origin, including genes for antimicrobial peptides, presumably reflects an intimate association of the animal-algal pair also with its prokaryotic microbiome.
Asunto(s)
Antozoos/fisiología , Genoma/genética , Anémonas de Mar/genética , Simbiosis/genética , Animales , Cromosomas/genética , Evolución Molecular , Perfilación de la Expresión Génica , Transferencia de Gen Horizontal/genética , Tamaño del Genoma , Interacciones Microbianas/genética , Modelos Biológicos , Anotación de Secuencia Molecular , Filogenia , Secuencias Repetitivas de Ácidos Nucleicos/genética , Sintenía/genéticaRESUMEN
We have discovered that positions of splice junctions in genes are constrained by the tolerance for disorder-promoting amino acids in the translated protein region. It is known that efficient splicing requires nucleotide bias at the splice junction; the preferred usage produces a distribution of amino acids that is disorder-promoting. We observe that efficiency of splicing, as seen in the amino-acid distribution, is not compromised to accommodate globular structure. Thus we infer that it is the positions of splice junctions in the gene that must be under constraint by the local protein environment. Examining exonic splicing enhancers found near the splice junction in the gene, reveals that these (short DNA motifs) are more prevalent in exons that encode disordered protein regions than exons encoding structured regions. Thus we also conclude that local protein features constrain efficient splicing more in structure than in disorder.
Asunto(s)
Proteínas Intrínsecamente Desordenadas/genética , Sitios de Empalme de ARN , Aminoácidos/análisis , Animales , Eucariontes/genética , Exones , Motivos de Nucleótidos , Nucleótidos/análisisRESUMEN
We present updates to the SUPERFAMILY 1.75 (http://supfam.org) online resource and protein sequence collection. The hidden Markov model library that provides sequence homology to SCOP structural domains remains unchanged at version 1.75. In the last 4 years SUPERFAMILY has more than doubled its holding of curated complete proteomes over all cellular life, from 1400 proteomes reported previously in 2010 up to 3258 at present. Outside of the main sequence collection, SUPERFAMILY continues to provide domain annotation for sequences provided by other resources such as: UniProt, Ensembl, PDB, much of JGI Phytozome and selected subcollections of NCBI RefSeq. Despite this growth in data volume, SUPERFAMILY now provides users with an expanded and daily updated phylogenetic tree of life (sTOL). This tree is built with genomic-scale domain annotation data as before, but constantly updated when new species are introduced to the sequence library. Our Gene Ontology and other functional and phenotypic annotations previously reported have stood up to critical assessment by the function prediction community. We have now introduced these data in an integrated manner online at the level of an individual sequence, and--in the case of whole genomes--with enrichment analysis against a taxonomically defined background.
Asunto(s)
Bases de Datos de Proteínas , Estructura Terciaria de Proteína , Ontología de Genes , Anotación de Secuencia Molecular , Filogenia , Proteínas/clasificación , Proteínas/genética , Proteoma/química , Análisis de Secuencia de ProteínaRESUMEN
Genome3D (http://www.genome3d.eu) is a collaborative resource that provides predicted domain annotations and structural models for key sequences. Since introducing Genome3D in a previous NAR paper, we have substantially extended and improved the resource. We have annotated representatives from Pfam families to improve coverage of diverse sequences and added a fast sequence search to the website to allow users to find Genome3D-annotated sequences similar to their own. We have improved and extended the Genome3D data, enlarging the source data set from three model organisms to 10, and adding VIVACE, a resource new to Genome3D. We have analysed and updated Genome3D's SCOP/CATH mapping. Finally, we have improved the superposition tools, which now give users a more powerful interface for investigating similarities and differences between structural models.
Asunto(s)
Bases de Datos de Proteínas , Anotación de Secuencia Molecular , Estructura Terciaria de Proteína , Algoritmos , Genómica , Internet , Modelos Moleculares , Estructura Terciaria de Proteína/genética , Análisis de Secuencia de ProteínaRESUMEN
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36,766 member database signatures integrated into 26,238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Bacterias/metabolismo , Ontología de Genes , Estructura Terciaria de Proteína , Proteínas/genética , Análisis de Secuencia de Proteína , Programas InformáticosRESUMEN
Humans are composed of hundreds of cell types. As the genomic DNA of each somatic cell is identical, cell type is determined by what is expressed and when. Until recently, little has been reported about the determinants of human cell identity, particularly from the joint perspective of gene evolution and expression. Here, we chart the evolutionary past of all documented human cell types via the collective histories of proteins, the principal product of gene expression. FANTOM5 data provide cell-type-specific digital expression of human protein-coding genes and the SUPERFAMILY resource is used to provide protein domain annotation. The evolutionary epoch in which each protein was created is inferred by comparison with domain annotation of all other completely sequenced genomes. Studying the distribution across epochs of genes expressed in each cell type reveals insights into human cellular evolution in terms of protein innovation. For each cell type, its history of protein innovation is charted based on the genes it expresses. Combining the histories of all cell types enables us to create a timeline of cell evolution. This timeline identifies the possibility that our common ancestor Coelomata (cavity-forming animals) provided the innovation required for the innate immune system, whereas cells which now form the brain of human have followed a trajectory of continually accumulating novel proteins since Opisthokonta (boundary of animals and fungi). We conclude that exaptation of existing domain architectures into new contexts is the dominant source of cell-type-specific domain architectures.
Asunto(s)
Evolución Molecular , Filogenia , Proteínas/química , Proteínas/genética , Células Eucariotas , Humanos , Inmunidad Innata , Estructura Terciaria de Proteína , Análisis de Secuencia de Proteína , TranscriptomaRESUMEN
We present the Proteome Quality Index (PQI; http://pqi-list.org), a much-needed resource for users of bacterial and eukaryotic proteomes. Completely sequenced genomes for which there is an available set of protein sequences (the proteome) are given a one- to five-star rating supported by 11 different metrics of quality. The database indexes over 3000 proteomes at the time of writing and is provided via a website for browsing, filtering and downloading. Previous to this work, there was no systematic way to account for the large variability in quality of the thousands of proteomes, and this is likely to have profoundly influenced the outcome of many published studies, in particular large-scale comparative analyses. The lack of a measure of proteome quality is likely due to the difficulty in producing one, a problem that we have approached by integrating multiple metrics. The continued development and improvement of the index will require the contribution of additional metrics by us and by others; the PQI provides a useful point of reference for the scientific community, but it is only the first step towards a 'standard' for the field.
Asunto(s)
Bases de Datos de Proteínas , Proteoma/normas , Genoma , InternetRESUMEN
We present the Database of Disordered Protein Prediction (D(2)P(2)), available at http://d2p2.pro (including website source code). A battery of disorder predictors and their variants, VL-XT, VSL2b, PrDOS, PV2, Espritz and IUPred, were run on all protein sequences from 1765 complete proteomes (to be updated as more genomes are completed). Integrated with these results are all of the predicted (mostly structured) SCOP domains using the SUPERFAMILY predictor. These disorder/structure annotations together enable comparison of the disorder predictors with each other and examination of the overlap between disordered predictions and SCOP domains on a large scale. D(2)P(2) will increase our understanding of the interplay between disorder and structure, the genomic distribution of disorder, and its evolutionary history. The parsed data are made available in a unified format for download as flat files or SQL tables either by genome, by predictor, or for the complete set. An interactive website provides a graphical view of each protein annotated with the SCOP domains and disordered regions from all predictors overlaid (or shown as a consensus). There are statistics and tools for browsing and comparing genomes and their disorder within the context of their position on the tree of life.
Asunto(s)
Bases de Datos de Proteínas , Conformación Proteica , Genoma , Internet , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/genética , Análisis de Secuencia de ProteínaRESUMEN
Cohort-wide sequencing studies have revealed that the largest category of variants is those deemed 'rare', even for the subset located in coding regions (99% of known coding variants are seen in less than 1% of the population. Associative methods give some understanding how rare genetic variants influence disease and organism-level phenotypes. But here we show that additional discoveries can be made through a knowledge-based approach using protein domains and ontologies (function and phenotype) that considers all coding variants regardless of allele frequency. We describe an ab initio, genetics-first method making molecular knowledge-based interpretations for exome-wide non-synonymous variants for phenotypes at the organism and cellular level. By using this reverse approach, we identify plausible genetic causes for developmental disorders that have eluded other established methods and present molecular hypotheses for the causal genetics of 40 phenotypes generated from a direct-to-consumer genotype cohort. This system offers a chance to extract further discovery from genetic data after standard tools have been applied.
Asunto(s)
Exoma , Predisposición Genética a la Enfermedad , Humanos , Fenotipo , Genotipo , Frecuencia de los GenesRESUMEN
We have identified that the collagen helix has the potential to be disruptive to analyses of intrinsically disordered proteins. The collagen helix is an extended fibrous structure that is both promiscuous and repetitive. Whilst its sequence is predicted to be disordered, this type of protein structure is not typically considered as intrinsic disorder. Here, we show that collagen-encoding proteins skew the distribution of exon lengths in genes. We find that previous results, demonstrating that exons encoding disordered regions are more likely to be symmetric, are due to the abundance of the collagen helix. Other related results, showing increased levels of alternative splicing in disorder-encoding exons, still hold after considering collagen-containing proteins. Aside from analyses of exons, we find that the set of proteins that contain collagen significantly alters the amino acid composition of regions predicted as disordered. We conclude that research in this area should be conducted in the light of the collagen helix.
Asunto(s)
Empalme Alternativo , Colágeno/química , Colágeno/genética , Exones , Secuencia de Aminoácidos , Genoma Humano , Humanos , Proteínas Intrínsecamente Desordenadas/química , Proteínas Intrínsecamente Desordenadas/genética , Conformación Proteica , Estructura Secundaria de ProteínaRESUMEN
To progress our understanding of molecular evolution from a collection of well-studied genes toward the level of the cell, we must consider whole systems. Here, we reveal the evolution of an important intracellular signaling system. The calcium-signaling toolkit is made up of different multidomain proteins that have undergone duplication, recombination, sequence divergence, and selection. The picture of evolution, considering the repertoire of proteins in the toolkit of both extant organisms and ancestors, is radically different from that of other systems. In eukaryotes, the repertoire increased in both abundance and diversity at a far greater rate than general genomic expansion. We describe how calcium-based intracellular signaling evolution differs not only in rate but in nature, and how this correlates with the disparity of plants and animals.
Asunto(s)
Señalización del Calcio/genética , Proteínas de Unión al Calcio/genética , Evolución Molecular , Animales , Proteínas de Unión al Calcio/química , Proteínas de Unión al Calcio/metabolismo , Eucariontes/genéticaRESUMEN
Transdifferentiation, the process of converting from one cell type to another without going through a pluripotent state, has great promise for regenerative medicine. The identification of key transcription factors for reprogramming is currently limited by the cost of exhaustive experimental testing of plausible sets of factors, an approach that is inefficient and unscalable. Here we present a predictive system (Mogrify) that combines gene expression data with regulatory network information to predict the reprogramming factors necessary to induce cell conversion. We have applied Mogrify to 173 human cell types and 134 tissues, defining an atlas of cellular reprogramming. Mogrify correctly predicts the transcription factors used in known transdifferentiations. Furthermore, we validated two new transdifferentiations predicted by Mogrify. We provide a practical and efficient mechanism for systematically implementing novel cell conversions, facilitating the generalization of reprogramming of human cells. Predictions are made available to help rapidly further the field of cell conversion.
Asunto(s)
Diferenciación Celular/genética , Transdiferenciación Celular/genética , Reprogramación Celular/genética , Redes Reguladoras de Genes , Fibroblastos , Humanos , Células Madre Pluripotentes Inducidas , Medicina Regenerativa , Factores de Transcripción/biosíntesis , Factores de Transcripción/genéticaRESUMEN
BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Asunto(s)
Biología Computacional , Proteínas/química , Programas Informáticos , Relación Estructura-Actividad , Algoritmos , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Anotación de Secuencia Molecular , Proteínas/genéticaRESUMEN
To help evaluate how protein function impacts on genome evolution, we introduce a new concept of 'architecture plasticity potential' - the capacity to form distinct domain architectures - both for an individual domain, or more generally for a set of domains grouped by shared function. We devise a scoring metric to measure the plasticity potential for these domain sets, and evaluate how function has changed over time for different species. Applying this metric to a phylogenetic tree of eukaryotic genomes, we find that the involvement of each function is not random but highly selective. For certain lineages there is strong bias for evolution to involve domains related to certain functions. In general eukaryotic genomes, particularly animals, expand complex functional activities such as signalling and regulation, but at the cost of reducing metabolic processes. We also observe differential evolution of transcriptional regulation and a unique evolutionary role of channel regulators; crucially this is only observable in terms of the architecture plasticity potential. Our findings provide a new layer of information to understand the significance of function in eukaryotic genome evolution. A web search tool, available at http://supfam.org/Pevo, offers a wide spectrum of options for exploring functional importance in eukaryotic genome evolution.
Asunto(s)
Eucariontes/genética , Evolución Molecular , Genoma , Genómica/métodos , Modelos Genéticos , Proteoma/química , Animales , Linaje de la Célula , Plasticidad de la Célula , Bases de Datos Genéticas , Bases de Datos de Proteínas , Eucariontes/citología , Eucariontes/metabolismo , Humanos , Internet , Filogenia , Estructura Terciaria de Proteína , Proteoma/genética , Proteoma/metabolismo , Motor de Búsqueda , Homología Estructural de ProteínaRESUMEN
The seven-transmembrane (7TM) helix fold of G-protein coupled receptors (GPCRs) has been adapted for a wide variety of physiologically important signaling functions. Here, we discuss the diversity in the structured and disordered regions of GPCRs based on the recently published crystal structures and sequence analysis of all human GPCRs. A comparison of the structures of rhodopsin-like receptors (class A), secretin-like receptors (class B), metabotropic receptors (class C) and frizzled receptors (class F) shows that the relative arrangement of the transmembrane helices is conserved across all four GPCR classes although individual receptors can be activated by ligand binding at varying positions within and around the transmembrane helical bundle. A systematic analysis of GPCR sequences reveals the presence of disordered segments in the cytoplasmic side, abundant post-translational modification sites, evidence for alternative splicing and several putative linear peptide motifs that have the potential to mediate interactions with cytosolic proteins. While the structured regions permit the receptor to bind diverse ligands, the disordered regions appear to have an underappreciated role in modulating downstream signaling in response to the cellular state. An integrated paradigm combining the knowledge of structured and disordered regions is imperative for gaining a holistic understanding of the GPCR (un)structure-function relationship.
Asunto(s)
Receptores Acoplados a Proteínas G/química , Animales , Membrana Celular/química , Membrana Celular/metabolismo , Humanos , Receptores Acoplados a Proteínas G/metabolismoRESUMEN
We report a daily-updated sequenced/species Tree Of Life (sTOL) as a reference for the increasing number of cellular organisms with their genomes sequenced. The sTOL builds on a likelihood-based weight calibration algorithm to consolidate NCBI taxonomy information in concert with unbiased sampling of molecular characters from whole genomes of all sequenced organisms. Via quantifying the extent of agreement between taxonomic and molecular data, we observe there are many potential improvements that can be made to the status quo classification, particularly in the Fungi kingdom; we also see that the current state of many animal genomes is rather poor. To augment the use of sTOL in providing evolutionary contexts, we integrate an ontology infrastructure and demonstrate its utility for evolutionary understanding on: nuclear receptors, stem cells and eukaryotic genomes. The sTOL (http://supfam.org/SUPERFAMILY/sTOL) provides a binary tree of (sequenced) life, and contributes to an analytical platform linking genome evolution, function and phenotype.