|

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity.

Reese, Fairlie; Williams, Brian; Balderrama-Gutierrez, Gabriela; Wyman, Dana; Çelik, Muhammed Hasan; Rebboah, Elisabeth; Rezaie, Narges; Trout, Diane; Razavi-Mohseni, Milad; Jiang, Yunzhe; Borsari, Beatrice; Morabito, Samuel; Liang, Heidi Yahan; McGill, Cassandra J; Rahmanian, Sorena; Sakr, Jasmine; Jiang, Shan; Zeng, Weihua; Carvalho, Klebea; Weimer, Annika K; Dionne, Louise A; McShane, Ariel; Bedi, Karan; Elhajjajy, Shaimae I; Upchurch, Sean; Jou, Jennifer; Youngworth, Ingrid; Gabdank, Idan; Sud, Paul; Jolanki, Otto; Strattan, J Seth; Kagda, Meenakshi S; Snyder, Michael P; Hitz, Ben C; Moore, Jill E; Weng, Zhiping; Bennett, David; Reinholdt, Laura; Ljungman, Mats; Beer, Michael A; Gerstein, Mark B; Pachter, Lior; Guigó, Roderic; Wold, Barbara J; Mortazavi, Ali.

bioRxiv ; 2023 May 16.

Article En | MEDLINE | ID: mdl-37292896

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

Author Correction: Expanded encyclopaedias of DNA elements in the human and mouse genomes.

Moore, Jill E; Purcaro, Michael J; Pratt, Henry E; Epstein, Charles B; Shoresh, Noam; Adrian, Jessika; Kawli, Trupti; Davis, Carrie A; Dobin, Alexander; Kaul, Rajinder; Halow, Jessica; Van Nostrand, Eric L; Freese, Peter; Gorkin, David U; Shen, Yin; He, Yupeng; Mackiewicz, Mark; Pauli-Behn, Florencia; Williams, Brian A; Mortazavi, Ali; Keller, Cheryl A; Zhang, Xiao-Ou; Elhajjajy, Shaimae I; Huey, Jack; Dickel, Diane E; Snetkova, Valentina; Wei, Xintao; Wang, Xiaofeng; Rivera-Mulia, Juan Carlos; Rozowsky, Joel; Zhang, Jing; Chhetri, Surya B; Zhang, Jialing; Victorsen, Alec; White, Kevin P; Visel, Axel; Yeo, Gene W; Burge, Christopher B; Lécuyer, Eric; Gilbert, David M; Dekker, Job; Rinn, John; Mendenhall, Eric M; Ecker, Joseph R; Kellis, Manolis; Klein, Robert J; Noble, William S; Kundaje, Anshul; Guigó, Roderic; Farnham, Peggy J.

Nature ; 605(7909): E3, 2022 May.

Article En | MEDLINE | ID: mdl-35474001

Integration of high-resolution promoter profiling assays reveals novel, cell type-specific transcription start sites across 115 human cell and tissue types.

Moore, Jill E; Zhang, Xiao-Ou; Elhajjajy, Shaimae I; Fan, Kaili; Pratt, Henry E; Reese, Fairlie; Mortazavi, Ali; Weng, Zhiping.

Genome Res ; 32(2): 389-402, 2022 02.

Article En | MEDLINE | ID: mdl-34949670

Accurate transcription start site (TSS) annotations are essential for understanding transcriptional regulation and its role in human disease. Gene collections such as GENCODE contain annotations for tens of thousands of TSSs, but not all of these annotations are experimentally validated nor do they contain information on cell type-specific usage. Therefore, we sought to generate a collection of experimentally validated TSSs by integrating RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression (RAMPAGE) data from 115 cell and tissue types, which resulted in a collection of approximately 50 thousand representative RAMPAGE peaks. These peaks are primarily proximal to GENCODE-annotated TSSs and are concordant with other transcription assays. Because RAMPAGE uses paired-end reads, we were then able to connect peaks to transcripts by analyzing the genomic positions of the 3' ends of read mates. Using this paired-end information, we classified the vast majority (37 thousand) of our RAMPAGE peaks as verified TSSs, updating TSS annotations for 20% of GENCODE genes. We also found that these updated TSS annotations are supported by epigenomic and other transcriptomic data sets. To show the utility of this RAMPAGE rPeak collection, we intersected it with the NHGRI/EBI genome-wide association study (GWAS) catalog and identified new candidate GWAS genes. Overall, our work shows the importance of integrating experimental data to further refine TSS annotations and provides a valuable resource for the biological community.

Gene Expression Regulation , Genome-Wide Association Study , Humans , Promoter Regions, Genetic , Transcription Initiation Site

Expanded encyclopaedias of DNA elements in the human and mouse genomes.

Nature ; 583(7818): 699-710, 2020 07.

Article En | MEDLINE | ID: mdl-32728249

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

DNA/genetics , Databases, Genetic , Genome/genetics , Genomics , Molecular Sequence Annotation , Registries , Regulatory Sequences, Nucleic Acid/genetics , Animals , Chromatin/genetics , Chromatin/metabolism , DNA/chemistry , DNA Footprinting , DNA Methylation/genetics , DNA Replication Timing , Deoxyribonuclease I/metabolism , Genome, Human , Histones/metabolism , Humans , Mice , Mice, Transgenic , RNA-Binding Proteins/genetics , Transcription, Genetic/genetics , Transposases/metabolism