RESUMO
Respiratory chain complexes can super-assemble into quaternary structures called supercomplexes that optimize cellular metabolism. The interaction between complexes III (CIII) and IV (CIV) is modulated by supercomplex assembly factor 1 (SCAF1, also known as COX7A2L). The discovery of SCAF1 represented strong genetic evidence that supercomplexes exist in vivo. SCAF1 is present as a long isoform (113 amino acids) or a short isoform (111 amino acids) in different mouse strains. Only the long isoform can induce the super-assembly of CIII and CIV, but it is not clear whether SCAF1 is required for the formation of the respirasome (a supercomplex of CI, CIII2 and CIV). Here we show, by combining deep proteomics and immunodetection analysis, that SCAF1 is always required for the interaction between CIII and CIV and that the respirasome is absent from most tissues of animals containing the short isoform of SCAF1, with the exception of heart and skeletal muscle. We used directed mutagenesis to characterize SCAF1 regions that interact with CIII and CIV and discovered that this interaction requires the correct orientation of a histidine residue at position 73 that is altered in the short isoform of SCAF1, explaining its inability to interact with CIV. Furthermore, we find that the CIV subunit COX7A2 is replaced by SCAF1 in supercomplexes containing CIII and CIV and by COX7A1 in CIV dimers, and that dimers seem to be more stable when they include COX6A2 rather than the COX6A1 isoform.
Assuntos
Membranas Mitocondriais/metabolismo , Isoformas de Proteínas/metabolismo , Animais , Complexo IV da Cadeia de Transporte de Elétrons/químicaRESUMO
High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
RESUMO
SUMMARY: Mass spectrometry-based proteomics has had a formidable development in recent years, increasing the amount of data handled and the complexity of the statistical resources needed. Here we present SanXoT, an open-source, standalone software package for the statistical analysis of high-throughput, quantitative proteomics experiments. SanXoT is based on our previously developed weighted spectrum, peptide and protein statistical model and has been specifically designed to be modular, scalable and user-configurable. SanXoT allows limitless workflows that adapt to most experimental setups, including quantitative protein analysis in multiple experiments, systems biology, quantification of post-translational modifications and comparison and merging of experimental data from technical or biological replicates. AVAILABILITY AND IMPLEMENTATION: Download links for the SanXoT Software Package, source code and documentation are available at https://wikis.cnic.es/proteomica/index.php/SSP. CONTACT: jvazquez@cnic.es or ebonzon@cnic.es. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.
Assuntos
Proteômica , Software , Espectrometria de Massas , Peptídeos , ProteínasRESUMO
Rab8 is a small Ras-related GTPase that regulates polarized membrane transport to the plasma membrane. Here, we developed a high-content analysis (HCA) tool to dissect Rab8-mediated actin and focal adhesion reorganization that revealed that Rab8 activation significantly induced Rac1 and Tiam1 to mediate cortical actin polymerization and RhoA-dependent stress fibre disassembly. Rab8 activation increased Rac1 activity, whereas its depletion activated RhoA, which led to reorganization of the actin cytoskeleton. Rab8 was also associated with focal adhesions, promoting their disassembly in a microtubule-dependent manner. This Rab8 effect involved calpain, MT1-MMP (also known as MMP14) and Rho GTPases. Moreover, we demonstrate the role of Rab8 in the cell migration process. Indeed, Rab8 is required for EGF-induced cell polarization and chemotaxis, as well as for the directional persistency of intrinsic cell motility. These data reveal that Rab8 drives cell motility by mechanisms both dependent and independent of Rho GTPases, thereby regulating the establishment of cell polarity, turnover of focal adhesions and actin cytoskeleton rearrangements, thus determining the directionality of cell migration.
Assuntos
Calpaína/metabolismo , Adesões Focais/metabolismo , Fatores de Troca do Nucleotídeo Guanina/metabolismo , Metaloproteinase 14 da Matriz/metabolismo , Proteínas rab de Ligação ao GTP/metabolismo , Proteínas rac1 de Ligação ao GTP/metabolismo , Proteínas rho de Ligação ao GTP/metabolismo , Citoesqueleto de Actina/metabolismo , Movimento Celular , Polaridade Celular , Células HeLa , Humanos , RNA Interferente Pequeno/genética , Fibras de Estresse/metabolismo , Proteína 1 Indutora de Invasão e Metástase de Linfoma de Células T , Proteínas rab de Ligação ao GTP/genética , Proteína rhoA de Ligação ao GTP/metabolismoRESUMO
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
Assuntos
Proteínas/genética , Biologia Computacional , Genoma Humano , Humanos , Fases de Leitura Aberta , Peptídeos/genética , Proteínas/metabolismo , ProteômicaRESUMO
Alternative splicing of messenger RNA can generate a wide variety of mature RNA transcripts, and these transcripts may produce protein isoforms with diverse cellular functions. While there is much supporting evidence for the expression of alternative transcripts, the same is not true for the alternatively spliced protein products. Large-scale mass spectroscopy experiments have identified evidence of alternative splicing at the protein level, but with conflicting results. Here we carried out a rigorous analysis of the peptide evidence from eight large-scale proteomics experiments to assess the scale of alternative splicing that is detectable by high-resolution mass spectroscopy. We find fewer splice events than would be expected: we identified peptides for almost 64% of human protein coding genes, but detected just 282 splice events. This data suggests that most genes have a single dominant isoform at the protein level. Many of the alternative isoforms that we could identify were only subtly different from the main splice isoform. Very few of the splice events identified at the protein level disrupted functional domains, in stark contrast to the two thirds of splice events annotated in the human genome that would lead to the loss or damage of functional domains. The most striking result was that more than 20% of the splice isoforms we identified were generated by substituting one homologous exon for another. This is significantly more than would be expected from the frequency of these events in the genome. These homologous exon substitution events were remarkably conserved--all the homologous exons we identified evolved over 460 million years ago--and eight of the fourteen tissue-specific splice isoforms we identified were generated from homologous exons. The combination of proteomics evidence, ancient origin and tissue-specific splicing indicates that isoforms generated from homologous exons may have important cellular roles.
Assuntos
Processamento Alternativo/genética , Éxons/genética , Isoformas de Proteínas/genética , Sequência de Aminoácidos , Animais , Biologia Computacional , Bases de Dados Genéticas , Humanos , Camundongos , Modelos Moleculares , Dados de Sequência Molecular , Especificidade de Órgãos/genética , Peptídeos/química , Peptídeos/genética , Peptídeos/metabolismo , Conformação Proteica , Isoformas de Proteínas/química , Isoformas de Proteínas/metabolismo , Alinhamento de Sequência , Análise de Sequência de DNARESUMO
Although eukaryotic cells express a wide range of alternatively spliced transcripts, it is not clear whether genes tend to express a range of transcripts simultaneously across cells, or produce dominant isoforms in a manner that is either tissue-specific or regardless of tissue. To date, large-scale investigations into the pattern of transcript expression across distinct tissues have produced contradictory results. Here, we attempt to determine whether genes express a dominant splice variant at the protein level. We interrogate peptides from eight large-scale human proteomics experiments and databases and find that there is a single dominant protein isoform, irrespective of tissue or cell type, for the vast majority of the protein-coding genes in these experiments, in partial agreement with the conclusions from the most recent large-scale RNAseq study. Remarkably, the dominant isoforms from the experimental proteomics analyses coincided overwhelmingly with the reference isoforms selected by two completely orthogonal sources, the consensus coding sequence variants, which are agreed upon by separate manual genome curation teams, and the principal isoforms from the APPRIS database, predicted automatically from the conservation of protein sequence, structure, and function.
Assuntos
Fases de Leitura Aberta/genética , Peptídeos/genética , Isoformas de Proteínas/genética , Proteômica/métodos , Biologia Computacional , Bases de Dados de Proteínas , HumanosRESUMO
Chimeric RNAs comprise exons from two or more different genes and have the potential to encode novel proteins that alter cellular phenotypes. To date, numerous putative chimeric transcripts have been identified among the ESTs isolated from several organisms and using high throughput RNA sequencing. The few corresponding protein products that have been characterized mostly result from chromosomal translocations and are associated with cancer. Here, we systematically establish that some of the putative chimeric transcripts are genuinely expressed in human cells. Using high throughput RNA sequencing, mass spectrometry experimental data, and functional annotation, we studied 7424 putative human chimeric RNAs. We confirmed the expression of 175 chimeric RNAs in 16 human tissues, with an abundance varying from 0.06 to 17 RPKM (Reads Per Kilobase per Million mapped reads). We show that these chimeric RNAs are significantly more tissue-specific than non-chimeric transcripts. Moreover, we present evidence that chimeras tend to incorporate highly expressed genes. Despite the low expression level of most chimeric RNAs, we show that 12 novel chimeras are translated into proteins detectable in multiple shotgun mass spectrometry experiments. Furthermore, we confirm the expression of three novel chimeric proteins using targeted mass spectrometry. Finally, based on our functional annotation of exon organization and preserved domains, we discuss the potential features of chimeric proteins with illustrative examples and suggest that chimeras significantly exploit signal peptides and transmembrane domains, which can alter the cellular localization of cognate proteins. Taken together, these findings establish that some chimeric RNAs are translated into potentially functional proteins in humans.
Assuntos
Genoma Humano , Proteínas Mutantes Quiméricas/genética , Biossíntese de Proteínas , Sequência de Aminoácidos , Membrana Celular/genética , Membrana Celular/metabolismo , Bases de Dados de Ácidos Nucleicos , Éxons , Regulação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Espectrometria de Massas/métodos , Anotação de Sequência Molecular , Dados de Sequência Molecular , Proteínas Mutantes Quiméricas/metabolismo , Especificidade de Órgãos , Sinais Direcionadores de Proteínas , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Proteômica/métodos , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Análise de Sequência de RNA/métodos , Relação Estrutura-AtividadeRESUMO
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Assuntos
Bases de Dados Genéticas , Genoma Humano , Genômica/métodos , Anotação de Sequência Molecular , Animais , Biologia Computacional/métodos , DNA Complementar/química , DNA Complementar/genética , Evolução Molecular , Éxons , Loci Gênicos , Humanos , Internet , Modelos Moleculares , Fases de Leitura Aberta , Pseudogenes , Controle de Qualidade , Sítios de Splice de RNA , RNA Longo não Codificante , Reprodutibilidade dos Testes , Regiões não TraduzidasRESUMO
The authors have carried out an investigation of the two "draft maps of the human proteome" published in 2014 in Nature. The findings include an abundance of poor spectra, low-scoring peptide-spectrum matches and incorrectly identified proteins in both these studies, highlighting clear issues with the application of false discovery rates. This noise means that the claims made by the two papers - the identification of high numbers of protein coding genes, the detection of novel coding regions and the draft tissue maps themselves - should be treated with considerable caution. The authors recommend that clinicians and researchers do not use the unfiltered data from these studies. Despite this these studies will inspire further investigation into tissue-based proteomics. As long as this future work has proper quality controls, it could help produce a consensus map of the human proteome and improve our understanding of the processes that underlie health and disease.
Assuntos
Bases de Dados de Proteínas , Proteoma/genética , Humanos , Peptídeos , ProteômicaRESUMO
Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.
Assuntos
Processamento Alternativo , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Isoformas de Proteínas/química , Isoformas de Proteínas/genética , Humanos , Internet , Isoformas de Proteínas/metabolismoRESUMO
This letter analyzes two large-scale proteomics studies published in the same issue of Nature. At the time of the release, both studies were portrayed as draft maps of the human proteome and great advances in the field. As with the initial publication of the human genome, these papers have broad appeal and will no doubt lead to a great deal of further analysis by the scientific community. However, we were intrigued by the number of protein-coding genes detected by the two studies, numbers that far exceeded what has been reported for the multinational Human Proteome Project effort. We carried out a simple quality test on the data using the olfactory receptor family. A high-quality proteomics experiment that does not specifically analyze nasal tissues should not expect to detect many peptides for olfactory receptors. Neither of the studies carried out experiments on nasal tissues, yet we found peptide evidence for more than 100 olfactory receptors in the two studies. These results suggest that the two studies are substantially overestimating the number of protein coding genes they identify. We conclude that the experimental data from these two studies should be used with caution.
Assuntos
Bases de Dados de Proteínas , Espectrometria de Massas , Proteoma/análise , Proteoma/química , Proteoma/metabolismo , Proteômica , HumanosRESUMO
Advances in high-throughput mass spectrometry are making proteomics an increasingly important tool in genome annotation projects. Peptides detected in mass spectrometry experiments can be used to validate gene models and verify the translation of putative coding sequences (CDSs). Here, we have identified peptides that cover 35% of the genes annotated by the GENCODE consortium for the human genome as part of a comprehensive analysis of experimental spectra from two large publicly available mass spectrometry databases. We detected the translation to protein of "novel" and "putative" protein-coding transcripts as well as transcripts annotated as pseudogenes and nonsense-mediated decay targets. We provide a detailed overview of the population of alternatively spliced protein isoforms that are detectable by peptide identification methods. We found that 150 genes expressed multiple alternative protein isoforms. This constitutes the largest set of reliably confirmed alternatively spliced proteins yet discovered. Three groups of genes were highly overrepresented. We detected alternative isoforms for 10 of the 25 possible heterogeneous nuclear ribonucleoproteins, proteins with a key role in the splicing process. Alternative isoforms generated from interchangeable homologous exons and from short indels were also significantly enriched, both in human experiments and in parallel analyses of mouse and Drosophila proteomics experiments. Our results show that a surprisingly high proportion (almost 25%) of the detected alternative isoforms are only subtly different from their constitutive counterparts. Many of the alternative splicing events that give rise to these alternative isoforms are conserved in mouse. It was striking that very few of these conserved splicing events broke Pfam functional domains or would damage globular protein structures. This evidence of a strong bias toward subtle differences in CDS and likely conserved cellular function and structure is remarkable and strongly suggests that the translation of alternative transcripts may be subject to selective constraints.
Assuntos
Processamento Alternativo , Proteínas/química , Proteínas/genética , Proteômica , Sequência de Aminoácidos , Animais , Domínio Catalítico , Drosophila , Genoma , Humanos , Camundongos , Modelos Moleculares , Anotação de Sequência Molecular , Dados de Sequência Molecular , Degradação do RNAm Mediada por Códon sem Sentido , Peptídeos/química , Peptídeos/genética , Complexo de Endopeptidases do Proteassoma/química , Biossíntese de Proteínas , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas , Isoformas de Proteínas , Proteínas/metabolismo , Alinhamento de SequênciaRESUMO
The identification of protein-protein interaction sites is an essential intermediate step for mutant design and the prediction of protein networks. In recent years a significant number of methods have been developed to predict these interface residues and here we review the current status of the field. Progress in this area requires a clear view of the methodology applied, the data sets used for training and testing the systems, and the evaluation procedures. We have analysed the impact of a representative set of features and algorithms and highlighted the problems inherent in generating reliable protein data sets and in the posterior analysis of the results. Although it is clear that there have been some improvements in methods for predicting interacting sites, several major bottlenecks remain. Proteins in complexes are still under-represented in the structural databases and in particular many proteins involved in transient complexes are still to be crystallized. We provide suggestions for effective feature selection, and make it clear that community standards for testing, training and performance measures are necessary for progress in the field.
Assuntos
Conformação Proteica , Mapeamento de Interação de Proteínas , Proteínas/química , Proteínas/metabolismo , Algoritmos , Sítios de Ligação , Bases de Dados de Proteínas , Complexos Multiproteicos/química , Complexos Multiproteicos/metabolismo , Mapeamento de Interação de Proteínas/métodos , Proteínas/genética , Eletricidade Estática , Propriedades de SuperfícieRESUMO
The EcID database (Escherichia coli Interaction Database) provides a framework for the integration of information on functional interactions extracted from the following sources: EcoCyc (metabolic pathways, protein complexes and regulatory information), KEGG (metabolic pathways), MINT and IntAct (protein interactions). It also includes information on protein complexes from the two E. coli high-throughput pull-down experiments and potential interactions extracted from the literature using the web services associated to the iHOP text-mining system. Additionally, EcID incorporates results of various prediction methods, including two protein interaction prediction methods based on genomic information (Phylogenetic Profiles and Gene Neighbourhoods) and three methods based on the analysis of co-evolution (Mirror Tree, In Silico 2 Hybrid and Context Mirror). EcID associates to each prediction a specifically developed confidence score. The two main features that make EcID different from other systems are the combination of co-evolution-based predictions with the experimental data, and the introduction of E. coli-specific information, such as gene regulation information from EcoCyc. The possibilities offered by the combination of the EcID database information are illustrated with a prediction of potential functions for a group of poorly characterized genes related to yeaG. EcID is available online at http://ecid.bioinfo.cnio.es.
Assuntos
Bases de Dados de Proteínas , Proteínas de Escherichia coli/metabolismo , Escherichia coli/metabolismo , Mapeamento de Interação de Proteínas , Biologia Computacional , Escherichia coli/genética , Integração de Sistemas , Interface Usuário-ComputadorRESUMO
In order to be successful CASP experiments require experimentally determined protein structures. These structures form the basis of the experiment. Structural genomics groups have provided the vast majority of these structures in recent editions of CASP. Before the structure prediction assessment can begin these target structures must be divided into structural domains for assessment purposes and each assessment unit must be assigned to one or more tertiary structure prediction categories. In CASP8 target domain boundaries were based on visual inspection of targets and their experimental data, and on superpositions of the target structures with related template structures. As in CASP7 target domains were broadly classified into two different categories: "template-based modeling" and "free modeling." Assessment categories were determined by structural similarity between the target domain and the nearest structural templates in the PDB and by whether or not related structural templates were used to build the models. The vast majority of the 164 assessment units in CASP8 were classified as template-based modeling. Just 10 target domains were defined as free modeling. In addition three targets were assessed in both the free modeling and template based categories and a subset of 50 template-based models was evaluated as part of the "high accuracy" subset. The targets submitted for CASP8 confirmed a trend that has been apparent since CASP5: targets submitted to the CASP experiments are becoming easier to predict.
Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/classificação , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Modelos Moleculares , Estrutura Terciária de ProteínaRESUMO
Here we detail the assessment process for the binding site prediction category of the eighth Critical Assessment of Protein Structure Prediction experiment (CASP8). Predictions were only evaluated for those targets that bound biologically relevant ligands and were assessed using the Matthews Correlation Coefficient. The results of the analysis clearly demonstrate that three predictors from two groups (Lee and Sternberg) stand out from the rest. A further two groups perform well over subsets of metal binding or nonmetal ligand binding targets. The best methods were able to make consistently reliable predictions based on model structures, though it was noticeable that the two targets that were not well predicted were also the hardest targets. The number of predictors that submitted new methods in this category was highly encouraging and suggests that current technology is at the level that experimental biochemists and structural biologists could benefit from what is clearly a growing field.
Assuntos
Biologia Computacional/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Sítios de Ligação , Ligantes , Modelos Moleculares , Conformação Proteica , Alinhamento de SequênciaRESUMO
This article details the assessment process and evaluation results for two categories in the 8th Critical Assessment of Protein Structure Prediction experiment (CASP8). The domain prediction category was evaluated with a range of scores including the Normalized Domain Overlap score and a domain boundary distance measure. Residue-residue contact predictions were evaluated with standard CASP measures, prediction accuracy, and Xd. In the domain boundary prediction category, prediction methods still make reliable predictions for targets that have structural templates, but continue to struggle to make good predictions for the few ab initio targets in CASP. There was little indication of improvement in the domain prediction category. The contact prediction category demonstrated that there was renewed interest among predictors and despite the small sample size the results suggested that there had been an increase in prediction accuracy. In contrast to CASP7 contact specialists predicted contacts more accurately than the majority of tertiary structure predictors. Despite this small success, the lack of free modeling targets makes it unlikely that either category will be included in their present form in CASP9.