RESUMO
BACKGROUND: Although transcription in mammalian genomes can initiate from various genomic positions (e.g., 3'UTR, coding exons, etc.), most locations on genomes are not prone to transcription initiation. It is of practical and theoretical interest to be able to estimate such collections of non-TSS locations (NTLs). The identification of large portions of NTLs can contribute to better focusing the search for TSS locations and thus contribute to promoter and gene finding. It can help in the assessment of 5' completeness of expressed sequences, contribute to more successful experimental designs, as well as more accurate gene annotation. METHODOLOGY: Using comprehensive collections of Cap Analysis of Gene Expression (CAGE) and other transcript data from mouse and human genomes, we developed a methodology that allows us, by performing computational TSS prediction with very high sensitivity, to annotate, with a high accuracy in a strand specific manner, locations of mammalian genomes that are highly unlikely to harbor transcription start sites (TSSs). The properties of the immediate genomic neighborhood of 98,682 accurately determined mouse and 113,814 human TSSs are used to determine features that distinguish genomic transcription initiation locations from those that are not likely to initiate transcription. In our algorithm we utilize various constraining properties of features identified in the upstream and downstream regions around TSSs, as well as statistical analyses of these surrounding regions. CONCLUSIONS: Our analysis of human chromosomes 4, 21 and 22 estimates â¼46%, â¼41% and â¼27% of these chromosomes, respectively, as being NTLs. This suggests that on average more than 40% of the human genome can be expected to be highly unlikely to initiate transcription. Our method represents the first one that utilizes high-sensitivity TSS prediction to identify, with high accuracy, large portions of mammalian genomes as NTLs. The server with our algorithm implemented is available at http://cbrc.kaust.edu.sa/ddm/.
Assuntos
Algoritmos , Biologia Computacional/métodos , Regiões Promotoras Genéticas/genética , Sítio de Iniciação de Transcrição , Animais , Sequência de Bases , Cromossomos Humanos Par 21/genética , Cromossomos Humanos Par 22/genética , Cromossomos Humanos Par 4/genética , Genoma/genética , Genoma Humano/genética , Humanos , Internet , Camundongos , Dados de Sequência Molecular , Receptores Opioides mu/genética , Reprodutibilidade dos Testes , Transcrição GênicaRESUMO
Combinatorial interactions among transcription factors are critical to directing tissue-specific gene expression. To build a global atlas of these combinations, we have screened for physical interactions among the majority of human and mouse DNA-binding transcription factors (TFs). The complete networks contain 762 human and 877 mouse interactions. Analysis of the networks reveals that highly connected TFs are broadly expressed across tissues, and that roughly half of the measured interactions are conserved between mouse and human. The data highlight the importance of TF combinations for determining cell fate, and they lead to the identification of a SMAD3/FLI1 complex expressed during development of immunity. The availability of large TF combinatorial networks in both human and mouse will provide many opportunities to study gene regulation, tissue differentiation, and mammalian evolution.
Assuntos
Regulação da Expressão Gênica , Redes Reguladoras de Genes , Fatores de Transcrição/metabolismo , Animais , Diferenciação Celular , Evolução Molecular , Humanos , Camundongos , Monócitos/citologia , Especificidade de Órgãos , Proteína Smad3/metabolismo , Transativadores/metabolismoRESUMO
BACKGROUND: Wheat is an allopolyploid plant that harbors a huge, complex genome. Therefore, accumulation of expressed sequence tags (ESTs) for wheat is becoming particularly important for functional genomics and molecular breeding. We prepared a comprehensive collection of ESTs from the various tissues that develop during the wheat life cycle and from tissues subjected to stress. We also examined their expression profiles in silico. As full-length cDNAs are indispensable to certify the collected ESTs and annotate the genes in the wheat genome, we performed a systematic survey and sequencing of the full-length cDNA clones. This sequence information is a valuable genetic resource for functional genomics and will enable carrying out comparative genomics in cereals. RESULTS: As part of the functional genomics and development of genomic wheat resources, we have generated a collection of full-length cDNAs from common wheat. By grouping the ESTs of recombinant clones randomly selected from the full-length cDNA library, we were able to sequence 6,162 independent clones with high accuracy. About 10% of the clones were wheat-unique genes, without any counterparts within the DNA database. Wheat clones that showed high homology to those of rice were selected in order to investigate their expression patterns in various tissues throughout the wheat life cycle and in response to abiotic-stress treatments. To assess the variability of genes that have evolved differently in wheat and rice, we calculated the substitution rate (Ka/Ks) of the counterparts in wheat and rice. Genes that were preferentially expressed in certain tissues or treatments had higher Ka/Ks values than those in other tissues and treatments, which suggests that the genes with the higher variability expressed in these tissues is under adaptive selection. CONCLUSION: We have generated a high-quality full-length cDNA resource for common wheat, which is essential for continuation of the ongoing curation and annotation of the wheat genome. The data for each clone's expression in various tissues and stress treatments and its variability in wheat and rice as a result of their diversification are valuable tools for functional genomics in wheat and for comparative genomics in cereals.
Assuntos
Adaptação Biológica/genética , Evolução Molecular , Oryza/genética , Plantas Tolerantes a Sal/genética , Triticum/genética , DNA Complementar/genética , DNA de Plantas/genética , Etiquetas de Sequências Expressas , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Biblioteca Gênica , Genes de Plantas , Genômica , Análise de Sequência de DNA , Estresse FisiológicoRESUMO
BACKGROUND: Small RNA attracts increasing interest based on the discovery of RNA silencing and the rapid progress of our understanding of these phenomena. Although recent studies suggest the possible existence of yet undiscovered types of small RNAs in higher organisms, many studies to profile small RNA have focused on miRNA and/or siRNA rather than on the exploration of additional classes of RNAs. RESULTS: Here, we explored human small RNAs by unbiased sequencing of RNAs with sizes of 19-40 nt. We provide substantial evidences for the existence of independent classes of small RNAs. Our data shows that well-characterized non-coding RNA, such as tRNA, snoRNA, and snRNA are cleaved at sites specific to the class of ncRNA. In particular, tRNA cleavage is regulated depending on tRNA type and tissue expression. We also found small RNAs mapped to genomic regions that are transcribed in both directions by bidirectional promoters, indicating that the small RNAs are a product of dsRNA formation and their subsequent cleavage. Their partial similarity with ribosomal RNAs (rRNAs) suggests unrevealed functions of ribosomal DNA or interstitial rRNA. Further examination revealed six novel miRNAs. CONCLUSION: Our results underscore the complexity of the small RNA world and the biogenesis of small RNAs.
Assuntos
Evolução Molecular , RNA/genética , RNA/metabolismo , Pareamento de Bases , Sequência de Bases , Northern Blotting , Biblioteca Gênica , Humanos , Dados de Sequência Molecular , Família Multigênica/genética , RNA/classificação , Alinhamento de Sequência , Análise de Sequência de RNARESUMO
BACKGROUND: The nucleus is a complex cellular organelle and accurately defining its protein content is essential before any systematic characterization can be considered. RESULTS: We report direct evidence for 2,568 mammalian proteins within the nuclear proteome: the nuclear subcellular localization of 1,529 proteins based on a high-throughput subcellular localization protocol of full-length proteins and an additional 1,039 proteins for which clear experimental evidence is documented in published literature. This is direct evidence that the nuclear proteome consists of at least 14% of the entire proteome. This dataset was used to evaluate computational approaches designed to identify additional nuclear proteins. CONCLUSION: This represents direct experimental evidence that the nuclear proteome consists of at least 14% of the entire proteome. This high-quality nuclear proteome dataset was used to evaluate computational approaches designed to identify additional nuclear proteins. Based on this analysis, researchers can determine the stringency and types of lines of evidence they consider to infer the size and complement of the nuclear proteome.
Assuntos
Núcleo Celular/química , Proteoma , Animais , Biologia Computacional/métodos , Humanos , Proteínas NuclearesRESUMO
Many genes are arranged in complex overlapping and interlaced patterns in eukaryotic genomes. It is unclear whether or how such genes can avoid interference from each other's RNA processing signals and retain distinct identities. This puzzle applies particularly to 3' end formation sites, which inherently terminate the transcript, and thus act as boundaries between adjacent genes. We hypothesise that the transcript processing machinery can bypass 3' end formation sites by splicing out an intron surrounding the site. We confirm a prediction of this hypothesis: the likelihood of transcripts extending beyond 3' end sites depends on the strength of 3' end formation signals located in exons in the mature transcript, but not of those in introns that are spliced out of the transcript. This bypassing mechanism permits nested and interleaved gene architectures, as well as fusion transcripts that combine exons from adjacent genes.
Assuntos
Regiões 3' não Traduzidas/genética , Processamento Alternativo/genética , Modelos Genéticos , Animais , Cromossomos de Mamíferos , DNA Complementar , Éxons , Etiquetas de Sequências Expressas , Genoma , Íntrons , Camundongos , Poliadenilação/genética , RNA Mensageiro/metabolismo , Transcrição GênicaRESUMO
The survival of motor neuron (SMN) protein, responsible for the neurodegenerative disease spinal muscular atrophy (SMA), oligomerizes and forms a stable complex with seven other major components, the Gemin proteins. Besides the SMN protein, Gemin2 is a core protein that is essential for the formation of the SMN complex, although the mechanism by which it drives formation is unclear. We have found a novel interaction, a Gemin2 self-association, using the mammalian two-hybrid system and the in vitro pull-down assays. Using in vitro dissociation assays, we also found that the self-interaction of the amino-terminal SMN protein, which was confirmed in this study, became stable in the presence of Gemin2. In addition, Gemin2 knockdown using small interference RNA treatment revealed a drastic decrease in SMN oligomer formation and in the assembly activity of spliceosomal small nuclear ribonucleoprotein (snRNP). Taken together, these results indicate that Gemin2 plays an important role in snRNP assembly through the stabilization of the SMN oligomer/complex via novel self-interaction. Applying the results/techniques to amino-terminal SMN missense mutants that were recently identified from SMA patients, we successfully showed that amino-terminal self-association, Gemin2 binding, the stabilization effect of Gemin2, and snRNP assembly activity were all lowered in the mutant SMN(D44V), suggesting that instability of the amino-terminal SMN self-association may cause SMA in patients carrying this allele.
Assuntos
Proteína de Ligação ao Elemento de Resposta ao AMP Cíclico/metabolismo , Proteínas do Tecido Nervoso/metabolismo , Proteínas de Ligação a RNA/metabolismo , Animais , Proteína de Ligação ao Elemento de Resposta ao AMP Cíclico/genética , Células HeLa , Humanos , Camundongos , Mutação/genética , Proteínas do Tecido Nervoso/genética , Ligação Proteica , Proteínas de Ligação a RNA/genética , Ribonucleoproteínas Nucleares Pequenas/metabolismo , Proteínas do Complexo SMNRESUMO
Hair cells express a complement of ion channels, representing shared and distinct channels that confer distinct electrophysiological signatures for each cell. This diversity is generated by the use of alternative splicing in the alpha subunit, formation of heterotetrameric channels, and combinatorial association with beta subunits. These channels are thought to play a role in the tonotopic gradient observed in the mammalian cochlea. Mouse Kcnma1 transcripts, 5' and 3' ESTs, and genomic sequences were examined for the utilization of alternative splicing in the mouse transcriptome. Comparative genomic analyses investigated the conservation of KCNMA1 splice sites. Genomes of mouse, rat, human, opossum, chicken, frog and zebrafish established that the exon-intron structure and mechanism of KCNMA1 alternative splicing were highly conserved with 6-7 splice sites being utilized. The murine Kcnma1 utilized 6 out of 7 potential splice sites. RT-PCR experiments using murine gene-specific oligonucleotide primers analyzed the scope and variety of Kcnma1 and Kcnmb1-4 expression profiles in the cochlea and inner ear hair cells. In the cochlea splice variants were present representing sites 3, 4, 6, and 7, while site 1 was insertionless and site 2 utilized only exon 10. However, site 5 was not present. Detection of KCNMA1 transcripts and protein exhibited a quantitative longitudinal gradient with a reciprocal gradient found between inner and outer hair cells. Differential expression was also observed in the usage of the long form of the carboxy-terminus tail. These results suggest that a diversity of splice variants exist in rodent cochlear hair cells and this diversity is similar to that observed for non-mammalian vertebrate hair cells, such as chicken and turtle.
Assuntos
Perfilação da Expressão Gênica , Variação Genética , Células Ciliadas Auditivas Internas/metabolismo , Subunidades alfa do Canal de Potássio Ativado por Cálcio de Condutância Alta/genética , Transcrição Gênica , Processamento Alternativo/genética , Animais , Sequência Conservada , Humanos , Hibridização In Situ , Subunidades alfa do Canal de Potássio Ativado por Cálcio de Condutância Alta/biossíntese , Camundongos , RatosRESUMO
BACKGROUND: Mammalian promoters do not initiate transcription at single, well defined base pairs, but rather at multiple, alternative start sites spread across a region. We previously characterized the static structures of transcription start site usage within promoters at the base pair level, based on large-scale sequencing of transcript 5' ends. RESULTS: In the present study we begin to explore the internal dynamics of mammalian promoters, and demonstrate that start site selection within many mouse core promoters varies among tissues. We also show that this dynamic usage of start sites is associated with CpG islands, broad and multimodal promoter structures, and imprinting. CONCLUSION: Our results reveal a new level of biologic complexity within promoters--fine-scale regulation of transcription starting events at the base pair level. These events are likely to be related to epigenetic transcriptional regulation.
Assuntos
Regiões Promotoras Genéticas , Transcrição Gênica , Animais , Ilhas de CpG , Metilação de DNA , Camundongos , Família MultigênicaRESUMO
Several recent studies indicate that mammals and other organisms produce large numbers of RNA transcripts that do not correspond to known genes. It has been suggested that these transcripts do not encode proteins, but may instead function as RNAs. However, discrimination of coding and non-coding transcripts is not straightforward, and different laboratories have used different methods, whose ability to perform this discrimination is unclear. In this study, we examine ten bioinformatic methods that assess protein-coding potential and compare their ability and congruency in the discrimination of non-coding from coding sequences, based on four underlying principles: open reading frame size, sequence similarity to known proteins or protein domains, statistical models of protein-coding sequence, and synonymous versus non-synonymous substitution rates. Despite these different approaches, the methods show broad concordance, suggesting that coding and non-coding transcripts can, in general, be reliably discriminated, and that many of the recently discovered extra-genic transcripts are indeed non-coding. Comparison of the methods indicates reasons for unreliable predictions, and approaches to increase confidence further. Conversely and surprisingly, our analyses also provide evidence that as much as approximately 10% of entries in the manually curated protein database Swiss-Prot are erroneous translations of actually non-coding transcripts.
Assuntos
Bioquímica/métodos , Técnicas Genéticas , RNA Mensageiro/química , RNA não Traduzido/química , Algoritmos , Animais , Biologia Computacional , DNA Complementar/metabolismo , Interpretação Estatística de Dados , Bases de Dados de Proteínas , Etiquetas de Sequências Expressas , Camundongos , Fases de Leitura Aberta , Estrutura Terciária de Proteína , Proteínas/química , RNA Mensageiro/genética , RNA não Traduzido/genéticaRESUMO
BACKGROUND: The TATA box, one of the most well studied core promoter elements, is associated with induced, context-specific expression. The lack of precise transcription start site (TSS) locations linked with expression information has impeded genome-wide characterization of the interaction between TATA and the pre-initiation complex. RESULTS: Using a comprehensive set of 5.66 x 10(6) sequenced 5' cDNA ends from diverse tissues mapped to the mouse genome, we found that the TATA-TSS distance is correlated with the tissue specificity of the downstream transcript. To achieve tissue-specific regulation, the TATA box position relative to the TSS is constrained to a narrow window (-32 to -29), where positions -31 and -30 are the optimal positions for achieving high tissue specificity. Slightly larger spacings can be accommodated only when there is no optimally spaced initiation signal; in contrast, the TATA box like motifs found downstream of position -28 are generally nonfunctional. The strength of the TATA binding protein-DNA interaction plays a subordinate role to spacing in terms of tissue specificity. Furthermore, promoters with different TATA-TSS spacings have distinct features in terms of consensus sequence around the initiation site and distribution of alternative TSSs. Unexpectedly, promoters that have two dominant, consecutive TSSs are TATA depleted and have a novel GGG initiation site consensus. CONCLUSION: In this report we present the most comprehensive characterization of TATA-TSS spacing and functionality to date. The coupling of spacing to tissue specificity at the transcriptome level provides important clues as to the function of core promoters and the choice of TSS by the pre-initiation complex.
Assuntos
Regulação da Expressão Gênica/genética , Regiões Promotoras Genéticas/genética , TATA Box/genética , Sítio de Iniciação de Transcrição , Animais , Simulação por Computador , Etiquetas de Sequências Expressas/metabolismo , Biblioteca Gênica , Genômica , Camundongos , Modelos GenéticosRESUMO
The mammalian transcriptome harbours shadowy entities that resist classification and analysis. In analogy with pseudogenes, we define pseudo-messenger RNA to be RNA molecules that resemble protein-coding mRNA, but cannot encode full-length proteins owing to disruptions of the reading frame. Using a rigorous computational pipeline, which rules out sequencing errors, we identify 10,679 pseudo-messenger RNAs (approximately half of which are transposon-associated) among the 102,801 FANTOM3 mouse cDNAs: just over 10% of the FANTOM3 transcriptome. These comprise not only transcribed pseudogenes, but also disrupted splice variants of otherwise protein-coding genes. Some may encode truncated proteins, only a minority of which appear subject to nonsense-mediated decay. The presence of an excess of transcripts whose only disruptions are opal stop codons suggests that there are more selenoproteins than currently estimated. We also describe compensatory frameshifts, where a segment of the gene has changed frame but remains translatable. In summary, we survey a large class of non-standard but potentially functional transcripts that are likely to encode genetic information and effect biological processes in novel ways. Many of these transcripts do not correspond cleanly to any identifiable object in the genome, implying fundamental limits to the goal of annotating all functional elements at the genome sequence level.
Assuntos
RNA Mensageiro/genética , Transcrição Gênica , Animais , Elementos de DNA Transponíveis , Evolução Molecular , Humanos , Camundongos , Regiões Promotoras Genéticas , Proteínas/genética , Pseudogenes , Reprodutibilidade dos Testes , Alinhamento de SequênciaRESUMO
We have surveyed the evolutionary trends of mammalian promoters and upstream sequences, utilising large sets of experimentally supported transcription start sites (TSSs). With 30,969 well-defined TSSs from mouse and 26,341 from human, there are sufficient numbers to draw statistically meaningful conclusions and to consider differences between promoter types. Unlike previous smaller studies, we have considered the effects of insertions, deletions, and transposable elements as well as nucleotide substitutions. The rate of promoter evolution relative to that of control sequences has not been consistent between lineages nor within lineages over time. The most pronounced manifestation of this heterotachy is the increased rate of evolution in primate promoters. This increase is seen across different classes of mutation, including substitutions and micro-indel events. We investigated the relationship between promoter and coding sequence selective constraint and suggest that they are generally uncorrelated. This analysis also identified a small number of mouse promoters associated with the immune response that are under positive selection in rodents. We demonstrate significant differences in divergence between functional promoter categories and identify a category of promoters, not associated with conventional protein-coding genes, that has the highest rates of divergence across mammals. We find that evolutionary rates vary both on a fine scale within mammalian promoters and also between different functional classes of promoters. The discovery of heterotachy in promoter evolution, in particular the accelerated evolution of primate promoters, has important implications for our understanding of human evolution and for strategies to detect primate-specific regulatory elements.
Assuntos
Evolução Molecular , Primatas/genética , Regiões Promotoras Genéticas , Transcrição Gênica , Animais , Sequência de Bases , Mapeamento Cromossômico , Elementos de DNA Transponíveis , Engenharia Genética , Variação Genética , Genoma , Humanos , Camundongos , Primatas/anatomia & histologia , Proteínas/genética , Análise de Sequência de DNA , Deleção de SequênciaRESUMO
Non-protein-coding RNAs (ncRNAs) are increasingly being recognized as having important regulatory roles. Although much recent attention has focused on tiny 22- to 25-nucleotide microRNAs, several functional ncRNAs are orders of magnitude larger in size. Examples of such macro ncRNAs include Xist and Air, which in mouse are 18 and 108 kilobases (Kb), respectively. We surveyed the 102,801 FANTOM3 mouse cDNA clones and found that Air and Xist were present not as single, full-length transcripts but as a cluster of multiple, shorter cDNAs, which were unspliced, had little coding potential, and were most likely primed from internal adenine-rich regions within longer parental transcripts. We therefore conducted a genome-wide search for regional clusters of such cDNAs to find novel macro ncRNA candidates. Sixty-six regions were identified, each of which mapped outside known protein-coding loci and which had a mean length of 92 Kb. We detected several known long ncRNAs within these regions, supporting the basic rationale of our approach. In silico analysis showed that many regions had evidence of imprinting and/or antisense transcription. These regions were significantly associated with microRNAs and transcripts from the central nervous system. We selected eight novel regions for experimental validation by northern blot and RT-PCR and found that the majority represent previously unrecognized noncoding transcripts that are at least 10 Kb in size and predominantly localized in the nucleus. Taken together, the data not only identify multiple new ncRNAs but also suggest the existence of many more macro ncRNAs like Xist and Air.
Assuntos
RNA não Traduzido/genética , Transcrição Gênica , Animais , Biologia Computacional , DNA Complementar/genética , Etiquetas de Sequências Expressas , Regulação da Expressão Gênica , Genoma , Genoma Humano , Humanos , Camundongos , Família Multigênica , RNA Longo não Codificante , Reação em Cadeia da Polimerase Via Transcriptase ReversaRESUMO
With the advancement of genome research, it is becoming clear that genes are not distributed on the genome in random order. Clusters of genes distributed at localized genome positions have been reported in several eukaryotes. Various correlations have been observed between the expressions of genes in adjacent or nearby positions along the chromosomes depending on tissue type and developmental stage. Moreover, in several cases, their transcripts, which control epigenetic transcription via processes such as transcriptional interference and genomic imprinting, occur in clusters. It is reasonable that genomic regions that have similar mechanisms show similar expression patterns and that the characteristics of expression in the same genomic regions differ depending on tissue type and developmental stage. In this study, we analyzed gene expression patterns using the cap analysis gene expression (CAGE) method for exploring systematic views of the mouse transcriptome. Counting the number of mapped CAGE tags for fixed-length regions allowed us to determine genomic expression levels. These expression levels were normalized, quantified, and converted into four types of descriptors, allowing the expression patterns along the genome to be represented by character strings. We analyzed them using dynamic programming in the same manner as for sequence analysis. We have developed a novel algorithm that provides a novel view of the genome from the perspective of genomic positional expression. In a similarity search of expression patterns across chromosomes and tissues, we found regions that had clusters of genes that showed expression patterns similar to each other depending on tissue type. Our results suggest the possibility that the regions that have sense-antisense transcription show similar expression patterns between forward and reverse strands.
Assuntos
Mapeamento Cromossômico/métodos , Genoma , Camundongos/genética , Transcrição Gênica , Algoritmos , Animais , Composição de Bases , Regulação da Expressão Gênica , Genoma Humano , Humanos , Macrófagos/fisiologia , MicroRNAs/genética , Modelos Genéticos , RNA não Traduzido/genéticaRESUMO
One of the most common splice variations are small exon length variations caused by the use of alternative donor or acceptor splice sites that are in very close proximity on the pre-mRNA. Among these, three-nucleotide variations at so-called NAGNAG tandem acceptor sites have recently attracted considerable attention, and it has been suggested that these variations are regulated and serve to fine-tune protein forms by the addition or removal of a single amino acid. In this paper we first show that in-frame exon length variations are generally overrepresented and that this overrepresentation can be quantitatively explained by the effect of nonsense-mediated decay. Our analysis allows us to estimate that about 50% of frame-shifted coding transcripts are targeted by nonsense-mediated decay. Second, we show that a simple physical model that assumes that the splicing machinery stochastically binds to nearby splice sites in proportion to the affinities of the sites correctly predicts the relative abundances of different small length variations at both boundaries. Finally, using the same simple physical model, we show that for NAGNAG sites, the difference in affinities of the neighboring sites for the splicing machinery accurately predicts whether splicing will occur only at the first site, splicing will occur only at the second site, or three-nucleotide splice variants are likely to occur. Our analysis thus suggests that small exon length variations are the result of stochastic binding of the spliceosome at neighboring splice sites. Small exon length variations occur when there are nearby alternative splice sites that have similar affinity for the splicing machinery.
Assuntos
Éxons/genética , Variação Genética , Modelos Genéticos , Animais , Mapeamento Cromossômico , Regulação da Expressão Gênica , Masculino , Camundongos , Músculo Esquelético/fisiologia , Especificidade de Órgãos , Próstata/fisiologia , Transcrição GênicaRESUMO
Membrane organization describes the orientation of a protein with respect to the membrane and can be determined by the presence, or absence, and organization within the protein sequence of two features: endoplasmic reticulum signal peptides and alpha-helical transmembrane domains. These features allow protein sequences to be classified into one of five membrane organization categories: soluble intracellular proteins, soluble secreted proteins, type I membrane proteins, type II membrane proteins, and multi-spanning membrane proteins. Generation of protein isoforms with variable membrane organizations can change a protein's subcellular localization or association with the membrane. Application of MemO, a membrane organization annotation pipeline, to the FANTOM3 Isoform Protein Sequence mouse protein set revealed that within the 8,032 transcriptional units (TUs) with multiple protein isoforms, 573 had variation in their use of signal peptides, 1,527 had variation in their use of transmembrane domains, and 615 generated protein isoforms from distinct membrane organization classes. The mechanisms underlying these transcript variations were analyzed. While TUs were identified encoding all pairwise combinations of membrane organization categories, the most common was conversion of membrane proteins to soluble proteins. Observed within our high-confidence set were 156 TUs predicted to generate both extracellular soluble and membrane proteins, and 217 TUs generating both intracellular soluble and membrane proteins. The differential use of endoplasmic reticulum signal peptides and transmembrane domains is a common occurrence within the variable protein output of TUs. The generation of protein isoforms that are targeted to multiple subcellular locations represents a major functional consequence of transcript variation within the mouse transcriptome.
Assuntos
Proteínas de Membrana/genética , Sinais Direcionadores de Proteínas/genética , Transcrição Gênica , Animais , Variação Genética , Isoformas de Proteínas/genéticaRESUMO
Mammalian genomes harbor a larger than expected number of complex loci, in which multiple genes are coupled by shared transcribed regions in antisense orientation and/or by bidirectional core promoters. To determine the incidence, functional significance, and evolutionary context of mammalian complex loci, we identified and characterized 5,248 cis-antisense pairs, 1,638 bidirectional promoters, and 1,153 chains of multiple cis-antisense and/or bidirectionally promoted pairs from 36,606 mouse transcriptional units (TUs), along with 6,141 cis-antisense pairs, 2,113 bidirectional promoters, and 1,480 chains from 42,887 human TUs. In both human and mouse, 25% of TUs resided in cis-antisense pairs, only 17% of which were conserved between the two organisms, indicating frequent species specificity of antisense gene arrangements. A sampling approach indicated that over 40% of all TUs might actually be in cis-antisense pairs, and that only a minority of these arrangements are likely to be conserved between human and mouse. Bidirectional promoters were characterized by variable transcriptional start sites and an identifiable midpoint at which overall sequence composition changed strand and the direction of transcriptional initiation switched. In microarray data covering a wide range of mouse tissues, genes in cis-antisense and bidirectionally promoted arrangement showed a higher probability of being coordinately expressed than random pairs of genes. In a case study on homeotic loci, we observed extensive transcription of nonconserved sequences on the noncoding strand, implying that the presence rather than the sequence of these transcripts is of functional importance. Complex loci are ubiquitous, host numerous nonconserved gene structures and lineage-specific exonification events, and may have a cis-regulatory impact on the member genes.
Assuntos
Mapeamento Cromossômico , Genoma , Camundongos , Animais , Camundongos/genética , Pareamento de Bases , Primers do DNA , Genoma Humano , Regiões Promotoras Genéticas , Reação em Cadeia da Polimerase Via Transcriptase Reversa , HumanosRESUMO
Short proteins play key roles in cell signalling and other processes, but their abundance in the mammalian proteome is unknown. Current catalogues of mammalian proteins exhibit an artefactual discontinuity at a length of 100 aa, so that protein abundance peaks just above this length and falls off sharply below it. To clarify the abundance of short proteins, we identify proteins in the FANTOM collection of mouse cDNAs by analysing synonymous and non-synonymous substitutions with the computer program CRITICA. This analysis confirms that there is no real discontinuity at length 100. Roughly 10% of mouse proteins are shorter than 100 aa, although the majority of these are variants of proteins longer than 100 aa. We identify many novel short proteins, including a "dark matter" subset containing ones that lack detectable homology to other known proteins. Translation assays confirm that some of these novel proteins can be translated and localised to the secretory pathway.
Assuntos
Camundongos/genética , Proteínas/genética , Proteoma , Sequência de Aminoácidos , Animais , Artefatos , DNA Complementar/genética , Variação Genética , Peso Molecular , Fases de Leitura Aberta , Biossíntese de Proteínas , Reprodutibilidade dos Testes , Homologia de Sequência de AminoácidosRESUMO
Using the two largest collections of Mus musculus and Homo sapiens transcription start sites (TSSs) determined based on CAGE tags, ditags, full-length cDNAs, and other transcript data, we describe the compositional landscape surrounding TSSs with the aim of gaining better insight into the properties of mammalian promoters. We classified TSSs into four types based on compositional properties of regions immediately surrounding them. These properties highlighted distinctive features in the extended core promoters that helped us delineate boundaries of the transcription initiation domain space for both species. The TSS types were analyzed for associations with initiating dinucleotides, CpG islands, TATA boxes, and an extensive collection of statistically significant cis-elements in mouse and human. We found that different TSS types show preferences for different sets of initiating dinucleotides and cis-elements. Through Gene Ontology and eVOC categories and tissue expression libraries we linked TSS characteristics to expression. Moreover, we show a link of TSS characteristics to very specific genomic organization in an example of immune-response-related genes (GO:0006955). Our results shed light on the global properties of the two transcriptomes not revealed before and therefore provide the framework for better understanding of the transcriptional mechanisms in the two species, as well as a framework for development of new and more efficient promoter- and gene-finding tools.