RESUMO
CAGE (cap analysis gene expression) and RNA-seq are two major technologies used to identify transcript abundances as well as structures. They measure expression by sequencing from either the 5' end of capped molecules (CAGE) or tags randomly distributed along the length of a transcript (RNA-seq). Library protocols for clonally amplified (Illumina, SOLiD, 454 Life Sciences [Roche], Ion Torrent), second-generation sequencing platforms typically employ PCR preamplification prior to clonal amplification, while third-generation, single-molecule sequencers can sequence unamplified libraries. Although these transcriptome profiling platforms have been demonstrated to be individually reproducible, no systematic comparison has been carried out between them. Here we compare CAGE, using both second- and third-generation sequencers, and RNA-seq, using a second-generation sequencer based on a panel of RNA mixtures from two human cell lines to examine power in the discrimination of biological states, detection of differentially expressed genes, linearity of measurements, and quantification reproducibility. We found that the quantified levels of gene expression are largely comparable across platforms and conclude that CAGE and RNA-seq are complementary technologies that can be used to improve incomplete gene models. We also found systematic bias in the second- and third-generation platforms, which is likely due to steps such as linker ligation, cleavage by restriction enzymes, and PCR amplification. This study provides a perspective on the performance of these platforms, which will be a baseline in the design of further experiments to tackle complex transcriptomes uncovered in a wide range of cell types.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA/genética , Transcriptoma/genética , Perfilação da Expressão Gênica , Humanos , Análise de Sequência de RNA/métodosRESUMO
Non-biting midges (Chironomidae) are known to inhabit a wide range of environments, and certain species can tolerate extreme conditions, where the rest of insects cannot survive. In particular, the sleeping chironomid Polypedilum vanderplanki is known for the remarkable ability of its larvae to withstand almost complete desiccation by entering a state called anhydrobiosis. Chromosome numbers in chironomids are higher than in other dipterans and this extra genomic resource might facilitate rapid adaptation to novel environments. We used improved sequencing strategies to assemble a chromosome-level genome sequence for P. vanderplanki for deep comparative analysis of genomic location of genes associated with desiccation tolerance. Using whole genome-based cross-species and intra-species analysis, we provide evidence for the unique functional specialization of Chromosome 4 through extensive acquisition of novel genes. In contrast to other insect genomes, in the sleeping chironomid a uniquely high degree of subfunctionalization in paralogous anhydrobiosis genes occurs in this chromosome, as well as pseudogenization in a highly duplicated gene family. Our findings suggest that the Chromosome 4 in Polypedilum is a site of high genetic turnover, allowing it to act as a 'sandbox' for evolutionary experiments, thus facilitating the rapid adaptation of midges to harsh environments.
RESUMO
Within the scope of the FANTOM6 consortium, we perform a large-scale knockdown of 200 long non-coding RNAs (lncRNAs) in human induced pluripotent stem cells (iPSCs) and systematically characterize their roles in self-renewal and pluripotency. We find 36 lncRNAs (18%) exhibiting cell growth inhibition. From the knockdown of 123 lncRNAs with transcriptome profiling, 36 lncRNAs (29.3%) show molecular phenotypes. Integrating the molecular phenotypes with chromatin-interaction assays further reveals cis- and trans-interacting partners as potential primary targets. Additionally, cell-type enrichment analysis identifies lncRNAs associated with pluripotency, while the knockdown of LINC02595, CATG00000090305.1, and RP11-148B6.2 modulates colony formation of iPSCs. We compare our results with previously published fibroblasts phenotyping data and find that 2.9% of the lncRNAs exhibit a consistent cell growth phenotype, whereas we observe 58.3% agreement in molecular phenotypes. This highlights that molecular phenotyping is more comprehensive in revealing affected pathways.
Assuntos
Células-Tronco Pluripotentes Induzidas , RNA Longo não Codificante , Humanos , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , Células-Tronco Pluripotentes Induzidas/metabolismo , Oligonucleotídeos Antissenso , Perfilação da Expressão Gênica/métodos , Células-Tronco Embrionárias/metabolismoRESUMO
The Cap Analysis of Gene Expression (CAGE) is a powerful method to identify Transcription Start Sites (TSSs) of capped RNAs while simultaneously measuring transcripts expression level. CAGE allows mapping at single nucleotide resolution at all active promoters and enhancers. Large CAGE datasets have been produced over the years from individual laboratories and consortia, including the Encyclopedia of DNA Elements (ENCODE) and Functional Annotation of the Mammalian Genome (FANTOM) consortia. These datasets constitute open resource for TSS annotations and gene expression analysis. Here, we provide an experimental protocol for the most recent CAGE method called Low Quantity (LQ) single strand (ss) CAGE "LQ-ssCAGE", which enables cost-effective profiling of low quantity RNA samples. LQ-ssCAGE is especially useful for samples derived from cells cultured in small volumes, cellular compartments such as nuclear RNAs or for samples from developmental stages. We demonstrate the reproducibility and effectiveness of the method by constructing 240 LQ-ssCAGE libraries from 50 ng of THP-1 cell extracted RNAs and discover lowly expressed novel enhancer and promoter-derived lncRNAs.
Assuntos
Biologia Computacional/métodos , Elementos Facilitadores Genéticos , Regiões Promotoras Genéticas , Capuzes de RNA , Sítio de Iniciação de Transcrição , Bases de Dados Genéticas , Regulação da Expressão Gênica , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Anotação de Sequência Molecular , Sequências Reguladoras de Ácido Nucleico , Reprodutibilidade dos Testes , Fluxo de TrabalhoRESUMO
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
Assuntos
Repetições de Microssatélites , Redes Neurais de Computação , Doenças Neurodegenerativas/genética , Sítio de Iniciação de Transcrição , Iniciação da Transcrição Genética , Células A549 , Animais , Sequência de Bases , Biologia Computacional/métodos , Aprendizado Profundo , Elementos Facilitadores Genéticos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Doenças Neurodegenerativas/diagnóstico , Doenças Neurodegenerativas/metabolismo , Polimorfismo Genético , Regiões Promotoras GenéticasRESUMO
Cap analysis of gene expression (CAGE) is an approach to identify and monitor the activity (transcription initiation frequency) of transcription start sites (TSSs) at single base-pair resolution across the genome. It has been effectively used to identify active promoter and enhancer regions in cancer cells, with potential utility to identify key factors to immunotherapy. Here, we overview a series of CAGE protocols and describe detailed experimental steps of the latest protocol based on the Illumina sequencing platform; both experimental steps (see Subheadings 3.1-3.11) and computational processing steps (see Subheadings 3.12-3.20) are described.
Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sítio de Iniciação de Transcrição , Ativação Transcricional , Animais , Expressão Gênica , Humanos , Camundongos , Regiões Promotoras GenéticasRESUMO
In the FANTOM5 project, transcription initiation events across the human and mouse genomes were mapped at a single base-pair resolution and their frequencies were monitored by CAGE (Cap Analysis of Gene Expression) coupled with single-molecule sequencing. Approximately three thousands of samples, consisting of a variety of primary cells, tissues, cell lines, and time series samples during cell activation and development, were subjected to a uniform pipeline of CAGE data production. The analysis pipeline started by measuring RNA extracts to assess their quality, and continued to CAGE library production by using a robotic or a manual workflow, single molecule sequencing, and computational processing to generate frequencies of transcription initiation. Resulting data represents the consequence of transcriptional regulation in each analyzed state of mammalian cells. Non-overlapping peaks over the CAGE profiles, approximately 200,000 and 150,000 peaks for the human and mouse genomes, were identified and annotated to provide precise location of known promoters as well as novel ones, and to quantify their activities.
Assuntos
Perfilação da Expressão Gênica , Genoma , Animais , Regulação da Expressão Gênica , Humanos , Camundongos , Regiões Promotoras Genéticas , Especificidade da EspécieRESUMO
Cap analysis of gene expression (CAGE) provides accurate high-throughput measurement of RNA expression. By the large-scale analysis of 5' end of transcripts using CAGE method, it enables not only determination of the transcription start site but also prediction of promoter region. Here we provide a protocol for the construction of no-amplification non-tagging CAGE libraries for Illumina next-generation sequencers (nAnT-iCAGE). We have excluded the commonly used PCR amplification and cleavage of restriction enzyme to eliminate any potential biases. As a result, we achieved less biased simple preparation process.