Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 31
Filter
1.
Cell ; 186(7): 1493-1511.e40, 2023 03 30.
Article in English | MEDLINE | ID: mdl-37001506

ABSTRACT

Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × âˆ¼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.


Subject(s)
Epigenome , Quantitative Trait Loci , Genome-Wide Association Study , Genomics , Phenotype , Polymorphism, Single Nucleotide
2.
Cell ; 157(2): 382-394, 2014 Apr 10.
Article in English | MEDLINE | ID: mdl-24725405

ABSTRACT

Missense mutations in the p53 tumor suppressor inactivate its antiproliferative properties but can also promote metastasis through a gain-of-function activity. We show that sustained expression of mutant p53 is required to maintain the prometastatic phenotype of a murine model of pancreatic cancer, a highly metastatic disease that frequently displays p53 mutations. Transcriptional profiling and functional screening identified the platelet-derived growth factor receptor b (PDGFRb) as both necessary and sufficient to mediate these effects. Mutant p53 induced PDGFRb through a cell-autonomous mechanism involving inhibition of a p73/NF-Y complex that represses PDGFRb expression in p53-deficient, noninvasive cells. Blocking PDGFRb signaling by RNA interference or by small molecule inhibitors prevented pancreatic cancer cell invasion in vitro and metastasis formation in vivo. Finally, high PDGFRb expression correlates with poor disease-free survival in pancreatic, colon, and ovarian cancer patients, implicating PDGFRb as a prognostic marker and possible target for attenuating metastasis in p53 mutant tumors.


Subject(s)
Carcinoma, Pancreatic Ductal/metabolism , Neoplasm Metastasis , Pancreatic Neoplasms/metabolism , Receptor, Platelet-Derived Growth Factor beta/metabolism , Tumor Suppressor Protein p53/metabolism , Animals , Carcinoma, Pancreatic Ductal/pathology , Disease Models, Animal , Gene Expression Profiling , Humans , Mice , Pancreatic Neoplasms/genetics , Pancreatic Neoplasms/pathology , Tumor Suppressor Protein p53/genetics
3.
Nature ; 583(7818): 699-710, 2020 07.
Article in English | MEDLINE | ID: mdl-32728249

ABSTRACT

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.


Subject(s)
DNA/genetics , Databases, Genetic , Genome/genetics , Genomics , Molecular Sequence Annotation , Registries , Regulatory Sequences, Nucleic Acid/genetics , Animals , Chromatin/genetics , Chromatin/metabolism , DNA/chemistry , DNA Footprinting , DNA Methylation/genetics , DNA Replication Timing , Deoxyribonuclease I/metabolism , Genome, Human , Histones/metabolism , Humans , Mice , Mice, Transgenic , RNA-Binding Proteins/genetics , Transcription, Genetic/genetics , Transposases/metabolism
4.
Genome Res ; 30(7): 1047-1059, 2020 07.
Article in English | MEDLINE | ID: mdl-32759341

ABSTRACT

We have produced RNA sequencing data for 53 primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural, and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex, and found that departures from the normal cellular composition correlate with histological phenotypes associated with disease.


Subject(s)
Transcription, Genetic , Cell Line , Endothelial Cells/metabolism , Epithelial Cells/metabolism , Female , Gene Expression Profiling , Gynecomastia/genetics , Gynecomastia/metabolism , Humans , Male , Mesoderm/cytology , Mesoderm/metabolism , Neoplasms/genetics , Organ Specificity , Sequence Analysis, RNA
6.
Genome Res ; 29(11): 1900-1909, 2019 11.
Article in English | MEDLINE | ID: mdl-31645363

ABSTRACT

MicroRNAs (miRNAs) play a critical role as posttranscriptional regulators of gene expression. The ENCODE Project profiled the expression of miRNAs in an extensive set of organs during a time-course of mouse embryonic development and captured the expression dynamics of 785 miRNAs. We found distinct organ-specific and developmental stage-specific miRNA expression clusters, with an overall pattern of increasing organ-specific expression as embryonic development proceeds. Comparative analysis of conserved miRNAs in mouse and human revealed stronger clustering of expression patterns by organ type rather than by species. An analysis of messenger RNA expression clusters compared with miRNA expression clusters identifies the potential role of specific miRNA expression clusters in suppressing the expression of mRNAs specific to other developmental programs in the organ in which these miRNAs are expressed during embryonic development. Our results provide the most comprehensive time-course of miRNA expression as part of an integrated ENCODE reference data set for mouse embryonic development.


Subject(s)
Embryonic Development/genetics , MicroRNAs/genetics , Animals , Female , Gene Expression Regulation, Developmental , Mice , Pregnancy , RNA, Messenger/genetics
7.
Nature ; 512(7515): 393-9, 2014 Aug 28.
Article in English | MEDLINE | ID: mdl-24670639

ABSTRACT

Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.


Subject(s)
Drosophila melanogaster/genetics , Gene Expression Profiling , Transcriptome/genetics , Alternative Splicing/genetics , Animals , Drosophila melanogaster/anatomy & histology , Drosophila melanogaster/cytology , Female , Male , Molecular Sequence Annotation , Nerve Tissue/metabolism , Organ Specificity , Poly A/genetics , Polyadenylation , Promoter Regions, Genetic/genetics , RNA, Long Noncoding/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism , Sex Characteristics , Stress, Physiological/genetics
8.
Nature ; 512(7515): 445-8, 2014 Aug 28.
Article in English | MEDLINE | ID: mdl-25164755

ABSTRACT

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.


Subject(s)
Caenorhabditis elegans/genetics , Drosophila melanogaster/genetics , Gene Expression Profiling , Transcriptome/genetics , Animals , Caenorhabditis elegans/embryology , Caenorhabditis elegans/growth & development , Chromatin/genetics , Cluster Analysis , Drosophila melanogaster/growth & development , Gene Expression Regulation, Developmental/genetics , Histones/metabolism , Humans , Larva/genetics , Larva/growth & development , Models, Genetic , Molecular Sequence Annotation , Promoter Regions, Genetic/genetics , Pupa/genetics , Pupa/growth & development , RNA, Untranslated/genetics , Sequence Analysis, RNA
9.
Nucleic Acids Res ; 46(D1): D794-D801, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29126249

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.


Subject(s)
DNA/genetics , Databases, Genetic , Gene Components , Genomics , High-Throughput Nucleotide Sequencing , Metadata , Animals , Caenorhabditis elegans/genetics , Data Display , Datasets as Topic , Drosophila melanogaster/genetics , Forecasting , Genome, Human , Humans , Mice/genetics , User-Computer Interface
10.
Nature ; 489(7414): 101-8, 2012 Sep 06.
Article in English | MEDLINE | ID: mdl-22955620

ABSTRACT

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.


Subject(s)
DNA/genetics , Encyclopedias as Topic , Genome, Human/genetics , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid/genetics , Transcription, Genetic/genetics , Transcriptome/genetics , Alleles , Cell Line , DNA, Intergenic/genetics , Enhancer Elements, Genetic , Exons/genetics , Gene Expression Profiling , Genes/genetics , Genomics , Humans , Polyadenylation/genetics , Protein Isoforms/genetics , RNA/biosynthesis , RNA/genetics , RNA Editing/genetics , RNA Splicing/genetics , Repetitive Sequences, Nucleic Acid/genetics , Sequence Analysis, RNA
11.
Nature ; 471(7339): 473-9, 2011 Mar 24.
Article in English | MEDLINE | ID: mdl-21179090

ABSTRACT

Drosophila melanogaster is one of the most well studied genetic model organisms; nonetheless, its genome still contains unannotated coding and non-coding genes, transcripts, exons and RNA editing sites. Full discovery and annotation are pre-requisites for understanding how the regulation of transcription, splicing and RNA editing directs the development of this complex organism. Here we used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events, and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. These data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.


Subject(s)
Drosophila melanogaster/growth & development , Drosophila melanogaster/genetics , Gene Expression Profiling , Gene Expression Regulation, Developmental/genetics , Transcription, Genetic/genetics , Alternative Splicing/genetics , Animals , Base Sequence , Drosophila Proteins/genetics , Drosophila melanogaster/embryology , Exons/genetics , Female , Genes, Insect/genetics , Genome, Insect/genetics , Male , MicroRNAs/genetics , Oligonucleotide Array Sequence Analysis , Protein Isoforms/genetics , RNA Editing/genetics , RNA, Messenger/analysis , RNA, Messenger/genetics , RNA, Small Untranslated/analysis , RNA, Small Untranslated/genetics , Sequence Analysis , Sex Characteristics
12.
Proc Natl Acad Sci U S A ; 111(48): 17224-9, 2014 Dec 02.
Article in English | MEDLINE | ID: mdl-25413365

ABSTRACT

Although the similarities between humans and mice are typically highlighted, morphologically and genetically, there are many differences. To better understand these two species on a molecular level, we performed a comparison of the expression profiles of 15 tissues by deep RNA sequencing and examined the similarities and differences in the transcriptome for both protein-coding and -noncoding transcripts. Although commonalities are evident in the expression of tissue-specific genes between the two species, the expression for many sets of genes was found to be more similar in different tissues within the same species than between species. These findings were further corroborated by associated epigenetic histone mark analyses. We also find that many noncoding transcripts are expressed at a low level and are not detectable at appreciable levels across individuals. Moreover, the majority lack obvious sequence homologs between species, even when we restrict our attention to those which are most highly reproducible across biological replicates. Overall, our results indicate that there is considerable RNA expression diversity between humans and mice, well beyond what was described previously, likely reflecting the fundamental physiological differences between these two organisms.


Subject(s)
DNA, Intergenic/genetics , Gene Expression Profiling/methods , Organ Specificity/genetics , Proteins/genetics , Animals , Epigenomics/methods , Evolution, Molecular , High-Throughput Nucleotide Sequencing , Humans , Mice, Inbred C57BL , Sequence Analysis, RNA , Species Specificity , Transcriptome/genetics
13.
Vet Anaesth Analg ; 44(4): 727-737, 2017 Jul.
Article in English | MEDLINE | ID: mdl-28624496

ABSTRACT

OBJECTIVE: To determine the effect of fentanyl on the induction dose of propofol and minimum infusion rate required to prevent movement in response to noxious stimulation (MIRNM) in dogs. STUDY DESIGN: Crossover experimental design. ANIMALS: Six healthy, adult intact male Beagle dogs, mean±standard deviation 12.6±0.4 kg. METHODS: Dogs were administered 0.9% saline (treatment P), fentanyl (5 µg kg-1) (treatment PLDF) or fentanyl (10 µg kg-1) (treatment PHDF) intravenously over 5 minutes. Five minutes later, anesthesia was induced with propofol (2 mg kg-1, followed by 1 mg kg-1 every 15 seconds to achieve intubation) and maintained for 90 minutes by constant rate infusions (CRIs) of propofol alone or with fentanyl: P, propofol (0.5 mg kg-1 minute-1); PLDF, propofol (0.35 mg kg-1 minute-1) and fentanyl (0.1 µg kg-1 minute-1); PHDF, propofol (0.3 mg kg-1 minute-1) and fentanyl (0.2 µg kg-1 minute-1). Propofol CRI was increased or decreased based on the response to stimulation (50 V, 50 Hz, 10 mA), with 20 minutes between adjustments. Data were analyzed using a mixed-model anova and presented as mean±standard error. RESULTS: ropofol induction doses were 6.16±0.31, 3.67±0.21 and 3.33±0.42 mg kg-1 for P, PLDF and PHDF, respectively. Doses for PLDF and PHDF were significantly decreased from P (p<0.05) but not different between treatments. Propofol MIRNM was 0.60±0.04, 0.29±0.02 and 0.22±0.02 mg kg-1 minute-1 for P, PLDF and PHDF, respectively. MIRNM in PLDF and PHDF was significantly decreased from P. MIRNM in PLDF and PHDF were not different, but their respective percent decreases of 51±3 and 63±2% differed (p=0.035). CONCLUSIONS AND CLINICAL RELEVANCE: Fentanyl, at the doses studied, caused statistically significant and clinically important decreases in the propofol induction dose and MIRNM.


Subject(s)
Anesthesia, Intravenous/veterinary , Anesthetics, Intravenous , Fentanyl/pharmacology , Propofol , Anesthesia, Intravenous/methods , Anesthetics, Combined/administration & dosage , Anesthetics, Combined/pharmacology , Anesthetics, Intravenous/administration & dosage , Animals , Dogs , Infusions, Intravenous/veterinary , Male , Movement/drug effects , Propofol/administration & dosage
14.
Genome Res ; 22(9): 1616-25, 2012 Sep.
Article in English | MEDLINE | ID: mdl-22955974

ABSTRACT

Splicing remains an incompletely understood process. Recent findings suggest that chromatin structure participates in its regulation. Here, we analyze the RNA from subcellular fractions obtained through RNA-seq in the cell line K562. We show that in the human genome, splicing occurs predominantly during transcription. We introduce the coSI measure, based on RNA-seq reads mapping to exon junctions and borders, to assess the degree of splicing completion around internal exons. We show that, as expected, splicing is almost fully completed in cytosolic polyA+ RNA. In chromatin-associated RNA (which includes the RNA that is being transcribed), for 5.6% of exons, the removal of the surrounding introns is fully completed, compared with 0.3% of exons for which no intron-removal has occurred. The remaining exons exist as a mixture of spliced and fewer unspliced molecules, with a median coSI of 0.75. Thus, most RNAs undergo splicing while being transcribed: "co-transcriptional splicing." Consistent with co-transcriptional spliceosome assembly and splicing, we have found significant enrichment of spliceosomal snRNAs in chromatin-associated RNA compared with other cellular RNA fractions and other nonspliceosomal snRNAs. CoSI scores decrease along the gene, pointing to a "first transcribed, first spliced" rule, yet more downstream exons carry other characteristics, favoring rapid, co-transcriptional intron removal. Exons with low coSI values, that is, in the process of being spliced, are enriched with chromatin marks, consistent with a role for chromatin in splicing during transcription. For alternative exons and long noncoding RNAs, splicing tends to occur later, and the latter might remain unspliced in some cases.


Subject(s)
Genome, Human , RNA Splicing , RNA, Long Noncoding/metabolism , Transcription, Genetic , Chromatin/metabolism , Cluster Analysis , Computational Biology/methods , Exons , High-Throughput Nucleotide Sequencing , Humans , RNA/genetics , RNA/metabolism , Sequence Analysis, RNA , Spliceosomes/genetics , Spliceosomes/metabolism , Subcellular Fractions/chemistry
15.
Genome Res ; 22(9): 1658-67, 2012 Sep.
Article in English | MEDLINE | ID: mdl-22955978

ABSTRACT

Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.


Subject(s)
Gene Expression Regulation , Genomics , Transcription Factors/metabolism , Transcription, Genetic , Base Composition , Binding Sites/genetics , Cell Line , Chromatin/genetics , Chromatin/metabolism , Computational Biology/methods , Histones/genetics , Humans , Models, Biological , Promoter Regions, Genetic , Protein Binding/genetics , Transcription Initiation Site
16.
Genome Res ; 22(9): 1775-89, 2012 Sep.
Article in English | MEDLINE | ID: mdl-22955988

ABSTRACT

The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences-particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.


Subject(s)
Databases, Genetic , RNA, Long Noncoding/genetics , Alternative Splicing , Animals , Cell Nucleus/genetics , Cell Nucleus/metabolism , Cluster Analysis , Evolution, Molecular , Exons , Gene Expression Profiling , Gene Expression Regulation , Histones/metabolism , Humans , Molecular Sequence Annotation , Open Reading Frames , Organ Specificity/genetics , Primates/genetics , RNA Processing, Post-Transcriptional , RNA Splice Sites , RNA, Messenger/genetics , Selection, Genetic , Transcription, Genetic
17.
Genome Res ; 21(9): 1543-51, 2011 Sep.
Article in English | MEDLINE | ID: mdl-21816910

ABSTRACT

High-throughput sequencing of cDNA (RNA-seq) is a widely deployed transcriptome profiling and annotation technique, but questions about the performance of different protocols and platforms remain. We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a 2(20) concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. We observed linearity between read density and RNA input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors. We use the control RNAs to directly measure reproducible protocol-dependent biases due to GC content and transcript length as well as stereotypic heterogeneity in coverage across transcripts correlated with position relative to RNA termini and priming sequence bias. These effects lead to biased quantification for short transcripts and individual exons, which is a serious problem for measurements of isoform abundances, but that can partially be corrected using appropriate models of bias. By using the control RNAs, we derive limits for the discovery and detection of rare transcripts in RNA-seq experiments. By using data collected as part of the model organism and human Encyclopedia of DNA Elements projects (ENCODE and modENCODE), we demonstrate that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms.


Subject(s)
RNA/chemistry , Sequence Analysis, RNA/standards , Animals , Bias , Gene Expression Profiling , Gene Library , High-Throughput Nucleotide Sequencing/standards , Humans , Quality Control , Reproducibility of Results , Sensitivity and Specificity
18.
Genome Res ; 21(2): 301-14, 2011 Feb.
Article in English | MEDLINE | ID: mdl-21177962

ABSTRACT

Drosophila melanogaster cell lines are important resources for cell biologists. Here, we catalog the expression of exons, genes, and unannotated transcriptional signals for 25 lines. Unannotated transcription is substantial (typically 19% of euchromatic signal). Conservatively, we identify 1405 novel transcribed regions; 684 of these appear to be new exons of neighboring, often distant, genes. Sixty-four percent of genes are expressed detectably in at least one line, but only 21% are detected in all lines. Each cell line expresses, on average, 5885 genes, including a common set of 3109. Expression levels vary over several orders of magnitude. Major signaling pathways are well represented: most differentiation pathways are "off" and survival/growth pathways "on." Roughly 50% of the genes expressed by each line are not part of the common set, and these show considerable individuality. Thirty-one percent are expressed at a higher level in at least one cell line than in any single developmental stage, suggesting that each line is enriched for genes characteristic of small sets of cells. Most remarkable is that imaginal disc-derived lines can generally be assigned, on the basis of expression, to small territories within developing discs. These mappings reveal unexpected stability of even fine-grained spatial determination. No two cell lines show identical transcription factor expression. We conclude that each line has retained features of an individual founder cell superimposed on a common "cell line" gene expression pattern.


Subject(s)
Drosophila melanogaster/genetics , Genetic Variation , Transcription, Genetic , Animals , Cell Line , Cluster Analysis , Exons , Female , Gene Expression Profiling , Male , Molecular Sequence Data , Signal Transduction/genetics , Transcription Factors/genetics
19.
Bioinformatics ; 29(1): 15-21, 2013 Jan 01.
Article in English | MEDLINE | ID: mdl-23104886

ABSTRACT

MOTIVATION: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. RESULTS: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. AVAILABILITY AND IMPLEMENTATION: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.


Subject(s)
Sequence Alignment/methods , Software , Algorithms , Cluster Analysis , Gene Expression Profiling , Genome, Human , Humans , RNA Splicing , Sequence Analysis, RNA/methods
20.
Nat Methods ; 7(7): 528-34, 2010 Jul.
Article in English | MEDLINE | ID: mdl-20543846

ABSTRACT

Large-scale sequencing projects have revealed an unexpected complexity in the origins, structures and functions of mammalian transcripts. Many loci are known to produce overlapping coding and noncoding RNAs with capped 5' ends that vary in size. Methods to identify the 5' ends of transcripts will facilitate the discovery of new promoters and 5' ends derived from secondary capping events. Such methods often require high input amounts of RNA not obtainable from highly refined samples such as tissue microdissections and subcellular fractions. Therefore, we developed nano-cap analysis of gene expression (nanoCAGE), a method that captures the 5' ends of transcripts from as little as 10 ng of total RNA, and CAGEscan, a mate-pair adaptation of nanoCAGE that captures the transcript 5' ends linked to a downstream region. Both of these methods allow further annotation-agnostic studies of the complex human transcriptome.


Subject(s)
Gene Expression Profiling , Gene Expression Regulation/physiology , Nanotechnology/methods , Promoter Regions, Genetic/physiology , RNA/metabolism , Genome, Human , Humans , RNA/genetics
SELECTION OF CITATIONS
SEARCH DETAIL