Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
1.
Nucleic Acids Res ; 2024 Jul 02.
Artículo en Inglés | MEDLINE | ID: mdl-38953162

RESUMEN

Ribosome profiling experiments support the translation of a range of novel human open reading frames. By contrast, most peptides from large-scale proteomics experiments derive from just one source, 5' untranslated regions. Across the human genome we find evidence for 192 translated upstream regions, most of which would produce protein isoforms with extended N-terminal ends. Almost all of these N-terminal extensions are from highly abundant genes, which suggests that the novel regions we detect are just the tip of the iceberg. These upstream regions have characteristics that are not typical of coding exons. Their GC-content is remarkably high, even higher than 5' regions in other genes, and a large majority have non-canonical start codons. Although some novel upstream regions have cross-species conservation - five have orthologues in invertebrates for example - the reading frames of two thirds are not conserved beyond simians. These non-conserved regions also have no evidence of purifying selection, which suggests that much of this translation is not functional. In addition, non-conserved upstream regions have significantly more peptides in cancer cell lines than would be expected, a strong indication that an aberrant or noisy translation initiation process may play an important role in translation from upstream regions.

2.
Nucleic Acids Res ; 50(D1): D54-D59, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34755885

RESUMEN

APPRIS (https://appris.bioinfo.cnio.es) is a well-established database housing annotations for protein isoforms for a range of species. APPRIS selects principal isoforms based on protein structure and function features and on cross-species conservation. Most coding genes produce a single main protein isoform and the principal isoforms chosen by the APPRIS database best represent this main cellular isoform. Human genetic data, experimental protein evidence and the distribution of clinical variants all support the relevance of APPRIS principal isoforms. APPRIS annotations and principal isoforms have now been expanded to 10 model organisms. In this paper we highlight the most recent updates to the database. APPRIS annotations have been generated for two new species, cow and chicken, the protein structural information has been augmented with reliable models from the EMBL-EBI AlphaFold database, and we have substantially expanded the confirmatory proteomics evidence available for the human genome. The most significant change in APPRIS has been the implementation of TRIFID functional isoform scores. TRIFID functional scores are assigned to all splice isoforms, and APPRIS uses the TRIFID functional scores and proteomics evidence to determine principal isoforms when core methods cannot.


Asunto(s)
Bases de Datos de Proteínas , Isoformas de Proteínas/genética , Proteínas/genética , Proteómica , Animales , Bovinos , Pollos/genética , Humanos , Conformación Proteica , Isoformas de Proteínas/clasificación , Proteínas/química , Proteínas/clasificación
3.
Bioinformatics ; 38(Suppl_2): ii89-ii94, 2022 09 16.
Artículo en Inglés | MEDLINE | ID: mdl-36124785

RESUMEN

MOTIVATION: Selecting the splice variant that best represents a coding gene is a crucial first step in many experimental analyses, and vital for mapping clinically relevant variants. This study compares the longest isoforms, MANE Select transcripts, APPRIS principal isoforms, and expression data, and aims to determine which method is best for selecting biological important reference splice variants for large-scale analyses. RESULTS: Proteomics analyses and human genetic variation data suggest that most coding genes have a single main protein isoform. We show that APPRIS principal isoforms and MANE Select transcripts best describe these main cellular isoforms, and find that using the longest splice variant as the representative is a poor strategy. Exons unique to the longest splice isoforms are not under selective pressure, and so are unlikely to be functionally relevant. Expression data are also a poor means of selecting the main splice variant. APPRIS principal and MANE Select exons are under purifying selection, while exons specific to alternative transcripts are not. There are MANE and APPRIS representatives for almost 95% of genes, and where they agree they are particularly effective, coinciding with the main proteomics isoform for over 98.2% of genes. AVAILABILITY AND IMPLEMENTATION: APPRIS principal isoforms for human, mouse and other model species can be downloaded from the APPRIS database (https://appris.bioinfo.cnio.es), GENCODE genes (https://www.gencodegenes.org/) and the Ensembl website (https://www.ensembl.org). MANE Select transcripts for the human reference set are available from the Ensembl, GENCODE and RefSeq databases (https://www.ncbi.nlm.nih.gov/refseq/). Lists of splice variants where MANE and APPRIS coincide are available from the APPRIS database. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteómica , Animales , Exones , Humanos , Ratones , Mutación , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo
4.
Nucleic Acids Res ; 49(14): 8232-8246, 2021 08 20.
Artículo en Inglés | MEDLINE | ID: mdl-34302486

RESUMEN

Most coding genes in the human genome are annotated with multiple alternative transcripts. However, clear evidence for the functional relevance of the protein isoforms produced by these alternative transcripts is often hard to find. Alternative isoforms generated from tandem exon duplication-derived substitutions are an exception. These splice events are rare, but have important functional consequences. Here, we have catalogued the 236 tandem exon duplication-derived substitutions annotated in the GENCODE human reference set. We find that more than 90% of the events have a last common ancestor in teleost fish, so are at least 425 million years old, and twenty-one can be traced back to the Bilateria clade. Alternative isoforms generated from tandem exon duplication-derived substitutions also have significantly more clinical impact than other alternative isoforms. Tandem exon duplication-derived substitutions have >25 times as many pathogenic and likely pathogenic mutations as other alternative events. Tandem exon duplication-derived substitutions appear to have vital functional roles in the cell and may have played a prominent part in metazoan evolution.


Asunto(s)
Evolución Molecular , Peces/genética , Genoma Humano/genética , Isoformas de Proteínas/genética , Empalme Alternativo/genética , Animales , Exones/genética , Duplicación de Gen/genética , Humanos , Anotación de Secuencia Molecular , Alineación de Secuencia
5.
Genet Med ; 24(11): 2351-2366, 2022 11.
Artículo en Inglés | MEDLINE | ID: mdl-36083290

RESUMEN

PURPOSE: Germline loss-of-function variants in CTNNB1 cause neurodevelopmental disorder with spastic diplegia and visual defects (NEDSDV; OMIM 615075) and are the most frequent, recurrent monogenic cause of cerebral palsy (CP). We investigated the range of clinical phenotypes owing to disruptions of CTNNB1 to determine the association between NEDSDV and CP. METHODS: Genetic information from 404 individuals with collectively 392 pathogenic CTNNB1 variants were ascertained for the study. From these, detailed phenotypes for 52 previously unpublished individuals were collected and combined with 68 previously published individuals with comparable clinical information. The functional effects of selected CTNNB1 missense variants were assessed using TOPFlash assay. RESULTS: The phenotypes associated with pathogenic CTNNB1 variants were similar. A diagnosis of CP was not significantly associated with any set of traits that defined a specific phenotypic subgroup, indicating that CP is not additional to NEDSDV. Two CTNNB1 missense variants were dominant negative regulators of WNT signaling, highlighting the utility of the TOPFlash assay to functionally assess variants. CONCLUSION: NEDSDV is a clinically homogeneous disorder irrespective of initial clinical diagnoses, including CP, or entry points for genetic testing.


Asunto(s)
Discapacidad Intelectual , Trastornos del Neurodesarrollo , Humanos , Fenotipo , Trastornos del Neurodesarrollo/genética , Vía de Señalización Wnt/genética , Discapacidad Intelectual/genética , Genómica , beta Catenina/genética
6.
Trends Biochem Sci ; 42(2): 98-110, 2017 02.
Artículo en Inglés | MEDLINE | ID: mdl-27712956

RESUMEN

Alternative splicing is commonly believed to be a major source of cellular protein diversity. However, although many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. The clearest finding from proteomics experiments is that most human genes have a single main protein isoform, while those alternative isoforms that are identified tend to be the most biologically plausible: those with the most cross-species conservation and those that do not compromise functional domains. Indeed, most alternative exons do not seem to be under selective pressure, suggesting that a large majority of predicted alternative transcripts may not even be translated into proteins.


Asunto(s)
Empalme Alternativo/genética , Proteoma/genética , Exones , Isoformas de Proteínas/genética , Proteómica
7.
Genome Res ; 2018 Feb 09.
Artículo en Inglés | MEDLINE | ID: mdl-29440222

RESUMEN

High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.

8.
PLoS Comput Biol ; 16(10): e1008287, 2020 10.
Artículo en Inglés | MEDLINE | ID: mdl-33017396

RESUMEN

The role of alternative splicing is one of the great unanswered questions in cellular biology. There is strong evidence for alternative splicing at the transcript level, and transcriptomics experiments show that many splice events are tissue specific. It has been suggested that alternative splicing evolved in order to remodel tissue-specific protein-protein networks. Here we investigated the evidence for tissue-specific splicing among splice isoforms detected in a large-scale proteomics analysis. Although the data supporting alternative splicing is limited at the protein level, clear patterns emerged among the small numbers of alternative splice events that we could detect in the proteomics data. More than a third of these splice events were tissue-specific and most were ancient: over 95% of splice events that were tissue-specific in both proteomics and RNAseq analyses evolved prior to the ancestors of lobe-finned fish, at least 400 million years ago. By way of contrast, three in four alternative exons in the human gene set arose in the primate lineage, so our results cannot be extrapolated to the whole genome. Tissue-specific alternative protein forms in the proteomics analysis were particularly abundant in nervous and muscle tissues and their genes had roles related to the cytoskeleton and either the structure of muscle fibres or cell-cell connections. Our results suggest that this conserved tissue-specific alternative splicing may have played a role in the development of the vertebrate brain and heart.


Asunto(s)
Empalme Alternativo/genética , Especificidad de Órganos/genética , Isoformas de Proteínas , Animales , Biología Computacional , Genoma/genética , Humanos , Isoformas de Proteínas/química , Isoformas de Proteínas/clasificación , Isoformas de Proteínas/genética , Proteómica
9.
Nucleic Acids Res ; 46(D1): D213-D217, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29069475

RESUMEN

The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the 'principal' isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants.


Asunto(s)
Bases de Datos Genéticas , Isoformas de Proteínas/genética , Empalme Alternativo , Secuencia de Aminoácidos , Animales , Humanos , Modelos Moleculares , Anotación de Secuencia Molecular , Conformación Proteica , Isoformas de Proteínas/química , Proteoma/genética , Reproducibilidad de los Resultados , Alineación de Secuencia
10.
Nucleic Acids Res ; 46(14): 7070-7084, 2018 08 21.
Artículo en Inglés | MEDLINE | ID: mdl-29982784

RESUMEN

Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.


Asunto(s)
Genes , Anticuerpos , Variaciones en el Número de Copia de ADN , Variación Genética , Genoma Humano , Humanos , Anotación de Secuencia Molecular , Proteínas/genética , Proteínas/inmunología , Proteínas/metabolismo , Seudogenes
11.
Nucleic Acids Res ; 43(W1): W455-9, 2015 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-25990727

RESUMEN

This paper introduces the APPRIS WebServer (http://appris.bioinfo.cnio.es) and WebServices (http://apprisws.bioinfo.cnio.es). Both the web servers and the web services are based around the APPRIS Database, a database that presently houses annotations of splice isoforms for five different vertebrate genomes. The APPRIS WebServer and WebServices provide access to the computational methods implemented in the APPRIS Database, while the APPRIS WebServices also allows retrieval of the annotations. The APPRIS WebServer and WebServices annotate splice isoforms with protein structural and functional features, and with data from cross-species alignments. In addition they can use the annotations of structure, function and conservation to select a single reference isoform for each protein-coding gene (the principal protein isoform). APPRIS principal isoforms have been shown to agree overwhelmingly with the main protein isoform detected in proteomics experiments. The APPRIS WebServer allows for the annotation of splice isoforms for individual genes, and provides a range of visual representations and tools to allow researchers to identify the likely effect of splicing events. The APPRIS WebServices permit users to generate annotations automatically in high throughput mode and to interrogate the annotations in the APPRIS Database. The APPRIS WebServices have been implemented using REST architecture to be flexible, modular and automatic.


Asunto(s)
Anotación de Secuencia Molecular , Isoformas de Proteínas/genética , Programas Informáticos , Empalme Alternativo , Animales , Gatos , Bovinos , Perros , Humanos , Internet , Ratones , Isoformas de Proteínas/química , Isoformas de Proteínas/metabolismo , Ratas
13.
Hum Mol Genet ; 23(22): 5866-78, 2014 Nov 15.
Artículo en Inglés | MEDLINE | ID: mdl-24939910

RESUMEN

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.


Asunto(s)
Proteínas/genética , Biología Computacional , Genoma Humano , Humanos , Sistemas de Lectura Abierta , Péptidos/genética , Proteínas/metabolismo , Proteómica
15.
Bioinformatics ; 31(14): 2257-61, 2015 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-25735770

RESUMEN

Transposable elements constitute a large fraction of vertebrate genomes and, during evolution, may be co-opted for new functions. Exonization of transposable elements inserted within or close to host genes is one possible way to generate new genes, and alternative splicing of the new exons may represent an intermediate step in this process. The genes TMPO and ZNF451 are present in all vertebrate lineages. Although they are not evolutionarily related, mammalian TMPO and ZNF451 do have something in common-they both code for splice isoforms that contain LAP2alpha domains. We found that these LAP2alpha domains have sequence similarity to repetitive sequences in non-mammalian genomes, which are in turn related to the first ORF from a DIRS1-like retrotransposon. This retrotransposon domestication happened separately and resulted in proteins that combine retrotransposon and host protein domains. The alternative splicing of the retrotransposed sequence allowed the production of both the new and the untouched original isoforms, which may have contributed to the success of the colonization process. The LAP2alpha-specific isoform of TMPO (LAP2α) has been co-opted for important roles in the cell, whereas the ZNF451 LAP2alpha isoform is evolving under strong purifying selection but remains uncharacterized.


Asunto(s)
Empalme Alternativo , Proteínas de Unión al ADN/genética , Mamíferos/genética , Proteínas de la Membrana/genética , Retroelementos , Factores de Transcripción/genética , Aminoaciltransferasas , Animales , Evolución Molecular , Exones , Genoma , Humanos , Isoformas de Proteínas/genética , Vertebrados/genética
16.
PLoS Comput Biol ; 11(6): e1004325, 2015 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-26061177

RESUMEN

Alternative splicing of messenger RNA can generate a wide variety of mature RNA transcripts, and these transcripts may produce protein isoforms with diverse cellular functions. While there is much supporting evidence for the expression of alternative transcripts, the same is not true for the alternatively spliced protein products. Large-scale mass spectroscopy experiments have identified evidence of alternative splicing at the protein level, but with conflicting results. Here we carried out a rigorous analysis of the peptide evidence from eight large-scale proteomics experiments to assess the scale of alternative splicing that is detectable by high-resolution mass spectroscopy. We find fewer splice events than would be expected: we identified peptides for almost 64% of human protein coding genes, but detected just 282 splice events. This data suggests that most genes have a single dominant isoform at the protein level. Many of the alternative isoforms that we could identify were only subtly different from the main splice isoform. Very few of the splice events identified at the protein level disrupted functional domains, in stark contrast to the two thirds of splice events annotated in the human genome that would lead to the loss or damage of functional domains. The most striking result was that more than 20% of the splice isoforms we identified were generated by substituting one homologous exon for another. This is significantly more than would be expected from the frequency of these events in the genome. These homologous exon substitution events were remarkably conserved--all the homologous exons we identified evolved over 460 million years ago--and eight of the fourteen tissue-specific splice isoforms we identified were generated from homologous exons. The combination of proteomics evidence, ancient origin and tissue-specific splicing indicates that isoforms generated from homologous exons may have important cellular roles.


Asunto(s)
Empalme Alternativo/genética , Exones/genética , Isoformas de Proteínas/genética , Secuencia de Aminoácidos , Animales , Biología Computacional , Bases de Datos Genéticas , Humanos , Ratones , Modelos Moleculares , Datos de Secuencia Molecular , Especificidad de Órganos/genética , Péptidos/química , Péptidos/genética , Péptidos/metabolismo , Conformación Proteica , Isoformas de Proteínas/química , Isoformas de Proteínas/metabolismo , Alineación de Secuencia , Análisis de Secuencia de ADN
17.
Nucleic Acids Res ; 42(Database issue): D267-72, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24243844

RESUMEN

FireDB (http://firedb.bioinfo.cnio.es) is a curated inventory of catalytic and biologically relevant small ligand-binding residues culled from the protein structures in the Protein Data Bank. Here we present the important new additions since the publication of FireDB in 2007. The database now contains an extensive list of manually curated biologically relevant compounds. Biologically relevant compounds are informative because of their role in protein function, but they are only a small fraction of the entire ligand set. For the remaining ligands, the FireDB provides cross-references to the annotations from publicly available biological, chemical and pharmacological compound databases. FireDB now has external references for 95% of contacting small ligands, making FireDB a more complete database and providing the scientific community with easy access to the pharmacological annotations of PDB ligands. In addition to the manual curation of ligands, FireDB also provides insights into the biological relevance of individual binding sites. Here, biological relevance is calculated from the multiple sequence alignments of related binding sites that are generated from all-against-all comparison of each FireDB binding site. The database can be accessed by RESTful web services and is available for download via MySQL.


Asunto(s)
Dominio Catalítico , Bases de Datos de Proteínas , Proteínas/química , Sitios de Unión , Evolución Molecular , Internet , Ligandos , Anotación de Secuencia Molecular , Preparaciones Farmacéuticas/química , Proteínas/genética
18.
J Proteome Res ; 14(4): 1880-7, 2015 Apr 03.
Artículo en Inglés | MEDLINE | ID: mdl-25732134

RESUMEN

Although eukaryotic cells express a wide range of alternatively spliced transcripts, it is not clear whether genes tend to express a range of transcripts simultaneously across cells, or produce dominant isoforms in a manner that is either tissue-specific or regardless of tissue. To date, large-scale investigations into the pattern of transcript expression across distinct tissues have produced contradictory results. Here, we attempt to determine whether genes express a dominant splice variant at the protein level. We interrogate peptides from eight large-scale human proteomics experiments and databases and find that there is a single dominant protein isoform, irrespective of tissue or cell type, for the vast majority of the protein-coding genes in these experiments, in partial agreement with the conclusions from the most recent large-scale RNAseq study. Remarkably, the dominant isoforms from the experimental proteomics analyses coincided overwhelmingly with the reference isoforms selected by two completely orthogonal sources, the consensus coding sequence variants, which are agreed upon by separate manual genome curation teams, and the principal isoforms from the APPRIS database, predicted automatically from the conservation of protein sequence, structure, and function.


Asunto(s)
Sistemas de Lectura Abierta/genética , Péptidos/genética , Isoformas de Proteínas/genética , Proteómica/métodos , Biología Computacional , Bases de Datos de Proteínas , Humanos
19.
Genome Res ; 22(7): 1231-42, 2012 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-22588898

RESUMEN

Chimeric RNAs comprise exons from two or more different genes and have the potential to encode novel proteins that alter cellular phenotypes. To date, numerous putative chimeric transcripts have been identified among the ESTs isolated from several organisms and using high throughput RNA sequencing. The few corresponding protein products that have been characterized mostly result from chromosomal translocations and are associated with cancer. Here, we systematically establish that some of the putative chimeric transcripts are genuinely expressed in human cells. Using high throughput RNA sequencing, mass spectrometry experimental data, and functional annotation, we studied 7424 putative human chimeric RNAs. We confirmed the expression of 175 chimeric RNAs in 16 human tissues, with an abundance varying from 0.06 to 17 RPKM (Reads Per Kilobase per Million mapped reads). We show that these chimeric RNAs are significantly more tissue-specific than non-chimeric transcripts. Moreover, we present evidence that chimeras tend to incorporate highly expressed genes. Despite the low expression level of most chimeric RNAs, we show that 12 novel chimeras are translated into proteins detectable in multiple shotgun mass spectrometry experiments. Furthermore, we confirm the expression of three novel chimeric proteins using targeted mass spectrometry. Finally, based on our functional annotation of exon organization and preserved domains, we discuss the potential features of chimeric proteins with illustrative examples and suggest that chimeras significantly exploit signal peptides and transmembrane domains, which can alter the cellular localization of cognate proteins. Taken together, these findings establish that some chimeric RNAs are translated into potentially functional proteins in humans.


Asunto(s)
Genoma Humano , Proteínas Mutantes Quiméricas/genética , Biosíntesis de Proteínas , Secuencia de Aminoácidos , Membrana Celular/genética , Membrana Celular/metabolismo , Bases de Datos de Ácidos Nucleicos , Exones , Regulación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Espectrometría de Masas/métodos , Anotación de Secuencia Molecular , Datos de Secuencia Molecular , Proteínas Mutantes Quiméricas/metabolismo , Especificidad de Órganos , Señales de Clasificación de Proteína , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Proteómica/métodos , ARN Mensajero/genética , ARN Mensajero/metabolismo , Análisis de Secuencia de ARN/métodos , Relación Estructura-Actividad
20.
Genome Res ; 22(9): 1760-74, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22955987

RESUMEN

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.


Asunto(s)
Bases de Datos Genéticas , Genoma Humano , Genómica/métodos , Anotación de Secuencia Molecular , Animales , Biología Computacional/métodos , ADN Complementario/química , ADN Complementario/genética , Evolución Molecular , Exones , Sitios Genéticos , Humanos , Internet , Modelos Moleculares , Sistemas de Lectura Abierta , Seudogenes , Control de Calidad , Sitios de Empalme de ARN , ARN Largo no Codificante , Reproducibilidad de los Resultados , Regiones no Traducidas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA