RESUMEN
Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (â¼30 tissues × â¼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.
Asunto(s)
Epigenoma , Sitios de Carácter Cuantitativo , Estudio de Asociación del Genoma Completo , Genómica , Fenotipo , Polimorfismo de Nucleótido SimpleRESUMEN
Missense mutations in the p53 tumor suppressor inactivate its antiproliferative properties but can also promote metastasis through a gain-of-function activity. We show that sustained expression of mutant p53 is required to maintain the prometastatic phenotype of a murine model of pancreatic cancer, a highly metastatic disease that frequently displays p53 mutations. Transcriptional profiling and functional screening identified the platelet-derived growth factor receptor b (PDGFRb) as both necessary and sufficient to mediate these effects. Mutant p53 induced PDGFRb through a cell-autonomous mechanism involving inhibition of a p73/NF-Y complex that represses PDGFRb expression in p53-deficient, noninvasive cells. Blocking PDGFRb signaling by RNA interference or by small molecule inhibitors prevented pancreatic cancer cell invasion in vitro and metastasis formation in vivo. Finally, high PDGFRb expression correlates with poor disease-free survival in pancreatic, colon, and ovarian cancer patients, implicating PDGFRb as a prognostic marker and possible target for attenuating metastasis in p53 mutant tumors.
Asunto(s)
Carcinoma Ductal Pancreático/metabolismo , Metástasis de la Neoplasia , Neoplasias Pancreáticas/metabolismo , Receptor beta de Factor de Crecimiento Derivado de Plaquetas/metabolismo , Proteína p53 Supresora de Tumor/metabolismo , Animales , Carcinoma Ductal Pancreático/patología , Modelos Animales de Enfermedad , Perfilación de la Expresión Génica , Humanos , Ratones , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/patología , Proteína p53 Supresora de Tumor/genéticaRESUMEN
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
Asunto(s)
ADN/genética , Bases de Datos Genéticas , Genoma/genética , Genómica , Anotación de Secuencia Molecular , Sistema de Registros , Secuencias Reguladoras de Ácidos Nucleicos/genética , Animales , Cromatina/genética , Cromatina/metabolismo , ADN/química , Huella de ADN , Metilación de ADN/genética , Momento de Replicación del ADN , Desoxirribonucleasa I/metabolismo , Genoma Humano , Histonas/metabolismo , Humanos , Ratones , Ratones Transgénicos , Proteínas de Unión al ARN/genética , Transcripción Genética/genética , Transposasas/metabolismoRESUMEN
We have produced RNA sequencing data for 53 primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural, and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex, and found that departures from the normal cellular composition correlate with histological phenotypes associated with disease.
Asunto(s)
Transcripción Genética , Línea Celular , Células Endoteliales/metabolismo , Células Epiteliales/metabolismo , Femenino , Perfilación de la Expresión Génica , Ginecomastia/genética , Ginecomastia/metabolismo , Humanos , Masculino , Mesodermo/citología , Mesodermo/metabolismo , Neoplasias/genética , Especificidad de Órganos , Análisis de Secuencia de ARNRESUMEN
MicroRNAs (miRNAs) play a critical role as posttranscriptional regulators of gene expression. The ENCODE Project profiled the expression of miRNAs in an extensive set of organs during a time-course of mouse embryonic development and captured the expression dynamics of 785 miRNAs. We found distinct organ-specific and developmental stage-specific miRNA expression clusters, with an overall pattern of increasing organ-specific expression as embryonic development proceeds. Comparative analysis of conserved miRNAs in mouse and human revealed stronger clustering of expression patterns by organ type rather than by species. An analysis of messenger RNA expression clusters compared with miRNA expression clusters identifies the potential role of specific miRNA expression clusters in suppressing the expression of mRNAs specific to other developmental programs in the organ in which these miRNAs are expressed during embryonic development. Our results provide the most comprehensive time-course of miRNA expression as part of an integrated ENCODE reference data set for mouse embryonic development.
Asunto(s)
Desarrollo Embrionario/genética , MicroARNs/genética , Animales , Femenino , Regulación del Desarrollo de la Expresión Génica , Ratones , Embarazo , ARN Mensajero/genéticaRESUMEN
Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.
Asunto(s)
Drosophila melanogaster/genética , Perfilación de la Expresión Génica , Transcriptoma/genética , Empalme Alternativo/genética , Animales , Drosophila melanogaster/anatomía & histología , Drosophila melanogaster/citología , Femenino , Masculino , Anotación de Secuencia Molecular , Tejido Nervioso/metabolismo , Especificidad de Órganos , Poli A/genética , Poliadenilación , Regiones Promotoras Genéticas/genética , ARN Largo no Codificante/genética , ARN Mensajero/genética , ARN Mensajero/metabolismo , Caracteres Sexuales , Estrés Fisiológico/genéticaRESUMEN
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Asunto(s)
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilación de la Expresión Génica , Transcriptoma/genética , Animales , Caenorhabditis elegans/embriología , Caenorhabditis elegans/crecimiento & desarrollo , Cromatina/genética , Análisis por Conglomerados , Drosophila melanogaster/crecimiento & desarrollo , Regulación del Desarrollo de la Expresión Génica/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crecimiento & desarrollo , Modelos Genéticos , Anotación de Secuencia Molecular , Regiones Promotoras Genéticas/genética , Pupa/genética , Pupa/crecimiento & desarrollo , ARN no Traducido/genética , Análisis de Secuencia de ARNRESUMEN
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.
Asunto(s)
ADN/genética , Bases de Datos Genéticas , Componentes del Gen , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Metadatos , Animales , Caenorhabditis elegans/genética , Presentación de Datos , Conjuntos de Datos como Asunto , Drosophila melanogaster/genética , Predicción , Genoma Humano , Humanos , Ratones/genética , Interfaz Usuario-ComputadorRESUMEN
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.
Asunto(s)
ADN/genética , Enciclopedias como Asunto , Genoma Humano/genética , Anotación de Secuencia Molecular , Secuencias Reguladoras de Ácidos Nucleicos/genética , Transcripción Genética/genética , Transcriptoma/genética , Alelos , Línea Celular , ADN Intergénico/genética , Elementos de Facilitación Genéticos , Exones/genética , Perfilación de la Expresión Génica , Genes/genética , Genómica , Humanos , Poliadenilación/genética , Isoformas de Proteínas/genética , ARN/biosíntesis , ARN/genética , Edición de ARN/genética , Empalme del ARN/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Análisis de Secuencia de ARNRESUMEN
Drosophila melanogaster is one of the most well studied genetic model organisms; nonetheless, its genome still contains unannotated coding and non-coding genes, transcripts, exons and RNA editing sites. Full discovery and annotation are pre-requisites for understanding how the regulation of transcription, splicing and RNA editing directs the development of this complex organism. Here we used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events, and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. These data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.
Asunto(s)
Drosophila melanogaster/crecimiento & desarrollo , Drosophila melanogaster/genética , Perfilación de la Expresión Génica , Regulación del Desarrollo de la Expresión Génica/genética , Transcripción Genética/genética , Empalme Alternativo/genética , Animales , Secuencia de Bases , Proteínas de Drosophila/genética , Drosophila melanogaster/embriología , Exones/genética , Femenino , Genes de Insecto/genética , Genoma de los Insectos/genética , Masculino , MicroARNs/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , Isoformas de Proteínas/genética , Edición de ARN/genética , ARN Mensajero/análisis , ARN Mensajero/genética , ARN Pequeño no Traducido/análisis , ARN Pequeño no Traducido/genética , Análisis de Secuencia , Caracteres SexualesRESUMEN
Although the similarities between humans and mice are typically highlighted, morphologically and genetically, there are many differences. To better understand these two species on a molecular level, we performed a comparison of the expression profiles of 15 tissues by deep RNA sequencing and examined the similarities and differences in the transcriptome for both protein-coding and -noncoding transcripts. Although commonalities are evident in the expression of tissue-specific genes between the two species, the expression for many sets of genes was found to be more similar in different tissues within the same species than between species. These findings were further corroborated by associated epigenetic histone mark analyses. We also find that many noncoding transcripts are expressed at a low level and are not detectable at appreciable levels across individuals. Moreover, the majority lack obvious sequence homologs between species, even when we restrict our attention to those which are most highly reproducible across biological replicates. Overall, our results indicate that there is considerable RNA expression diversity between humans and mice, well beyond what was described previously, likely reflecting the fundamental physiological differences between these two organisms.
Asunto(s)
ADN Intergénico/genética , Perfilación de la Expresión Génica/métodos , Especificidad de Órganos/genética , Proteínas/genética , Animales , Epigenómica/métodos , Evolución Molecular , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Ratones Endogámicos C57BL , Análisis de Secuencia de ARN , Especificidad de la Especie , Transcriptoma/genéticaRESUMEN
OBJECTIVE: To determine the effect of fentanyl on the induction dose of propofol and minimum infusion rate required to prevent movement in response to noxious stimulation (MIRNM) in dogs. STUDY DESIGN: Crossover experimental design. ANIMALS: Six healthy, adult intact male Beagle dogs, mean±standard deviation 12.6±0.4 kg. METHODS: Dogs were administered 0.9% saline (treatment P), fentanyl (5 µg kg-1) (treatment PLDF) or fentanyl (10 µg kg-1) (treatment PHDF) intravenously over 5 minutes. Five minutes later, anesthesia was induced with propofol (2 mg kg-1, followed by 1 mg kg-1 every 15 seconds to achieve intubation) and maintained for 90 minutes by constant rate infusions (CRIs) of propofol alone or with fentanyl: P, propofol (0.5 mg kg-1 minute-1); PLDF, propofol (0.35 mg kg-1 minute-1) and fentanyl (0.1 µg kg-1 minute-1); PHDF, propofol (0.3 mg kg-1 minute-1) and fentanyl (0.2 µg kg-1 minute-1). Propofol CRI was increased or decreased based on the response to stimulation (50 V, 50 Hz, 10 mA), with 20 minutes between adjustments. Data were analyzed using a mixed-model anova and presented as mean±standard error. RESULTS: ropofol induction doses were 6.16±0.31, 3.67±0.21 and 3.33±0.42 mg kg-1 for P, PLDF and PHDF, respectively. Doses for PLDF and PHDF were significantly decreased from P (p<0.05) but not different between treatments. Propofol MIRNM was 0.60±0.04, 0.29±0.02 and 0.22±0.02 mg kg-1 minute-1 for P, PLDF and PHDF, respectively. MIRNM in PLDF and PHDF was significantly decreased from P. MIRNM in PLDF and PHDF were not different, but their respective percent decreases of 51±3 and 63±2% differed (p=0.035). CONCLUSIONS AND CLINICAL RELEVANCE: Fentanyl, at the doses studied, caused statistically significant and clinically important decreases in the propofol induction dose and MIRNM.
Asunto(s)
Anestesia Intravenosa/veterinaria , Anestésicos Intravenosos , Fentanilo/farmacología , Propofol , Anestesia Intravenosa/métodos , Anestésicos Combinados/administración & dosificación , Anestésicos Combinados/farmacología , Anestésicos Intravenosos/administración & dosificación , Animales , Perros , Infusiones Intravenosas/veterinaria , Masculino , Movimiento/efectos de los fármacos , Propofol/administración & dosificaciónRESUMEN
Splicing remains an incompletely understood process. Recent findings suggest that chromatin structure participates in its regulation. Here, we analyze the RNA from subcellular fractions obtained through RNA-seq in the cell line K562. We show that in the human genome, splicing occurs predominantly during transcription. We introduce the coSI measure, based on RNA-seq reads mapping to exon junctions and borders, to assess the degree of splicing completion around internal exons. We show that, as expected, splicing is almost fully completed in cytosolic polyA+ RNA. In chromatin-associated RNA (which includes the RNA that is being transcribed), for 5.6% of exons, the removal of the surrounding introns is fully completed, compared with 0.3% of exons for which no intron-removal has occurred. The remaining exons exist as a mixture of spliced and fewer unspliced molecules, with a median coSI of 0.75. Thus, most RNAs undergo splicing while being transcribed: "co-transcriptional splicing." Consistent with co-transcriptional spliceosome assembly and splicing, we have found significant enrichment of spliceosomal snRNAs in chromatin-associated RNA compared with other cellular RNA fractions and other nonspliceosomal snRNAs. CoSI scores decrease along the gene, pointing to a "first transcribed, first spliced" rule, yet more downstream exons carry other characteristics, favoring rapid, co-transcriptional intron removal. Exons with low coSI values, that is, in the process of being spliced, are enriched with chromatin marks, consistent with a role for chromatin in splicing during transcription. For alternative exons and long noncoding RNAs, splicing tends to occur later, and the latter might remain unspliced in some cases.
Asunto(s)
Genoma Humano , Empalme del ARN , ARN Largo no Codificante/metabolismo , Transcripción Genética , Cromatina/metabolismo , Análisis por Conglomerados , Biología Computacional/métodos , Exones , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , ARN/genética , ARN/metabolismo , Análisis de Secuencia de ARN , Empalmosomas/genética , Empalmosomas/metabolismo , Fracciones Subcelulares/químicaRESUMEN
Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.
Asunto(s)
Regulación de la Expresión Génica , Genómica , Factores de Transcripción/metabolismo , Transcripción Genética , Composición de Base , Sitios de Unión/genética , Línea Celular , Cromatina/genética , Cromatina/metabolismo , Biología Computacional/métodos , Histonas/genética , Humanos , Modelos Biológicos , Regiones Promotoras Genéticas , Unión Proteica/genética , Sitio de Iniciación de la TranscripciónRESUMEN
The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences-particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.
Asunto(s)
Bases de Datos Genéticas , ARN Largo no Codificante/genética , Empalme Alternativo , Animales , Núcleo Celular/genética , Núcleo Celular/metabolismo , Análisis por Conglomerados , Evolución Molecular , Exones , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Histonas/metabolismo , Humanos , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta , Especificidad de Órganos/genética , Primates/genética , Procesamiento Postranscripcional del ARN , Sitios de Empalme de ARN , ARN Mensajero/genética , Selección Genética , Transcripción GenéticaRESUMEN
High-throughput sequencing of cDNA (RNA-seq) is a widely deployed transcriptome profiling and annotation technique, but questions about the performance of different protocols and platforms remain. We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a 2(20) concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. We observed linearity between read density and RNA input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors. We use the control RNAs to directly measure reproducible protocol-dependent biases due to GC content and transcript length as well as stereotypic heterogeneity in coverage across transcripts correlated with position relative to RNA termini and priming sequence bias. These effects lead to biased quantification for short transcripts and individual exons, which is a serious problem for measurements of isoform abundances, but that can partially be corrected using appropriate models of bias. By using the control RNAs, we derive limits for the discovery and detection of rare transcripts in RNA-seq experiments. By using data collected as part of the model organism and human Encyclopedia of DNA Elements projects (ENCODE and modENCODE), we demonstrate that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms.
Asunto(s)
ARN/química , Análisis de Secuencia de ARN/normas , Animales , Sesgo , Perfilación de la Expresión Génica , Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Control de Calidad , Reproducibilidad de los Resultados , Sensibilidad y EspecificidadRESUMEN
Drosophila melanogaster cell lines are important resources for cell biologists. Here, we catalog the expression of exons, genes, and unannotated transcriptional signals for 25 lines. Unannotated transcription is substantial (typically 19% of euchromatic signal). Conservatively, we identify 1405 novel transcribed regions; 684 of these appear to be new exons of neighboring, often distant, genes. Sixty-four percent of genes are expressed detectably in at least one line, but only 21% are detected in all lines. Each cell line expresses, on average, 5885 genes, including a common set of 3109. Expression levels vary over several orders of magnitude. Major signaling pathways are well represented: most differentiation pathways are "off" and survival/growth pathways "on." Roughly 50% of the genes expressed by each line are not part of the common set, and these show considerable individuality. Thirty-one percent are expressed at a higher level in at least one cell line than in any single developmental stage, suggesting that each line is enriched for genes characteristic of small sets of cells. Most remarkable is that imaginal disc-derived lines can generally be assigned, on the basis of expression, to small territories within developing discs. These mappings reveal unexpected stability of even fine-grained spatial determination. No two cell lines show identical transcription factor expression. We conclude that each line has retained features of an individual founder cell superimposed on a common "cell line" gene expression pattern.
Asunto(s)
Drosophila melanogaster/genética , Variación Genética , Transcripción Genética , Animales , Línea Celular , Análisis por Conglomerados , Exones , Femenino , Perfilación de la Expresión Génica , Masculino , Datos de Secuencia Molecular , Transducción de Señal/genética , Factores de Transcripción/genéticaRESUMEN
MOTIVATION: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. RESULTS: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. AVAILABILITY AND IMPLEMENTATION: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Asunto(s)
Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Análisis por Conglomerados , Perfilación de la Expresión Génica , Genoma Humano , Humanos , Empalme del ARN , Análisis de Secuencia de ARN/métodosRESUMEN
Large-scale sequencing projects have revealed an unexpected complexity in the origins, structures and functions of mammalian transcripts. Many loci are known to produce overlapping coding and noncoding RNAs with capped 5' ends that vary in size. Methods to identify the 5' ends of transcripts will facilitate the discovery of new promoters and 5' ends derived from secondary capping events. Such methods often require high input amounts of RNA not obtainable from highly refined samples such as tissue microdissections and subcellular fractions. Therefore, we developed nano-cap analysis of gene expression (nanoCAGE), a method that captures the 5' ends of transcripts from as little as 10 ng of total RNA, and CAGEscan, a mate-pair adaptation of nanoCAGE that captures the transcript 5' ends linked to a downstream region. Both of these methods allow further annotation-agnostic studies of the complex human transcriptome.