|

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification.

Pardo-Palacios, Francisco J; Wang, Dingjie; Reese, Fairlie; Diekhans, Mark; Carbonell-Sala, Sílvia; Williams, Brian; Loveland, Jane E; De María, Maite; Adams, Matthew S; Balderrama-Gutierrez, Gabriela; Behera, Amit K; Gonzalez Martinez, Jose M; Hunt, Toby; Lagarde, Julien; Liang, Cindy E; Li, Haoran; Meade, Marcus Jerryd; Moraga Amador, David A; Prjibelski, Andrey D; Birol, Inanc; Bostan, Hamed; Brooks, Ashley M; Çelik, Muhammed Hasan; Chen, Ying; Du, Mei R M; Felton, Colette; Göke, Jonathan; Hafezqorani, Saber; Herwig, Ralf; Kawaji, Hideya; Lee, Joseph; Li, Jian-Liang; Lienhard, Matthias; Mikheenko, Alla; Mulligan, Dennis; Nip, Ka Ming; Pertea, Mihaela; Ritchie, Matthew E; Sim, Andre D; Tang, Alison D; Wan, Yuk Kei; Wang, Changqing; Wong, Brandon Y; Yang, Chen; Barnes, If; Berry, Andrew E; Capella-Gutierrez, Salvador; Cousineau, Alyssa; Dhillon, Namrita; Fernandez-Gonzalez, Jose M.

Nat Methods ; 2024 Jun 07.

Article En | MEDLINE | ID: mdl-38849569

The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

Identifying dysregulated regions in amyotrophic lateral sclerosis through chromatin accessibility outliers.

Çelik, Muhammed Hasan; Gagneur, Julien; Lim, Ryan G; Wu, Jie; Thompson, Leslie M; Xie, Xiaohui.

HGG Adv ; : 100318, 2024 Jun 13.

Article En | MEDLINE | ID: mdl-38872308

The high heritability of ALS contrasts with its low molecular diagnosis rate post-genetic testing, pointing to potential undiscovered genetic factors. To aid the exploration of these factors, we introduced EpiOut, an algorithm to identify chromatin accessibility outliers that are regions exhibiting divergent accessibility from the population baseline in a single or few samples. Annotation of accessible regions with histone ChIP-seq and Hi-C indicates that outliers are concentrated in functional loci, especially among promoters interacting with active enhancers. Across different omics levels, outliers are robustly replicated, and chromatin accessibility outliers are reliable predictors of gene expression outliers and aberrant protein levels. When promoter accessibility does not align with gene expression, our results indicate that molecular aberrations are more likely to be linked to post-transcriptional regulation rather than transcriptional regulation. Our findings demonstrate that the outlier detection paradigm can uncover dysregulated regions in rare diseases. EpiOut is available at github.com/uci-cbcl/EpiOut.

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity.

Reese, Fairlie; Williams, Brian; Balderrama-Gutierrez, Gabriela; Wyman, Dana; Çelik, Muhammed Hasan; Rebboah, Elisabeth; Rezaie, Narges; Trout, Diane; Razavi-Mohseni, Milad; Jiang, Yunzhe; Borsari, Beatrice; Morabito, Samuel; Liang, Heidi Yahan; McGill, Cassandra J; Rahmanian, Sorena; Sakr, Jasmine; Jiang, Shan; Zeng, Weihua; Carvalho, Klebea; Weimer, Annika K; Dionne, Louise A; McShane, Ariel; Bedi, Karan; Elhajjajy, Shaimae I; Upchurch, Sean; Jou, Jennifer; Youngworth, Ingrid; Gabdank, Idan; Sud, Paul; Jolanki, Otto; Strattan, J Seth; Kagda, Meenakshi S; Snyder, Michael P; Hitz, Ben C; Moore, Jill E; Weng, Zhiping; Bennett, David; Reinholdt, Laura; Ljungman, Mats; Beer, Michael A; Gerstein, Mark B; Pachter, Lior; Guigó, Roderic; Wold, Barbara J; Mortazavi, Ali.

bioRxiv ; 2023 May 16.

Article En | MEDLINE | ID: mdl-37292896

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

MTSplice predicts effects of genetic variants on tissue-specific splicing.

Cheng, Jun; Çelik, Muhammed Hasan; Kundaje, Anshul; Gagneur, Julien.

Genome Biol ; 22(1): 94, 2021 03 31.

Article En | MEDLINE | ID: mdl-33789710

We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.

Alternative Splicing , Computational Biology/methods , Genetic Variation , Software , Autistic Disorder/genetics , Brain/metabolism , Computational Biology/standards , Exons , Gene Expression Profiling , Gene Expression Regulation , Humans , Introns , Organ Specificity

Publisher Correction: MTSplice predicts effects of genetic variants on tissue-specific splicing.

Cheng, Jun; Çelik, Muhammed Hasan; Kundaje, Anshul; Gagneur, Julien.

Genome Biol ; 22(1): 107, 2021 Apr 15.

Article En | MEDLINE | ID: mdl-33858505

Quantification of Proteins and Histone Marks in Drosophila Embryos Reveals Stoichiometric Relationships Impacting Chromatin Regulation.

Bonnet, Jacques; Lindeboom, Rik G H; Pokrovsky, Daniil; Stricker, Georg; Çelik, Muhammed Hasan; Rupp, Ralph A W; Gagneur, Julien; Vermeulen, Michiel; Imhof, Axel; Müller, Jürg.

Dev Cell ; 51(5): 632-644.e6, 2019 12 02.

Article En | MEDLINE | ID: mdl-31630981

Gene transcription in eukaryotes is regulated through dynamic interactions of a variety of different proteins with DNA in the context of chromatin. Here, we used mass spectrometry for absolute quantification of the nuclear proteome and methyl marks on selected lysine residues in histone H3 during two stages of Drosophila embryogenesis. These analyses provide comprehensive information about the absolute copy number of several thousand proteins and reveal unexpected relationships between the abundance of histone-modifying and -binding proteins and the chromatin landscape that they generate and interact with. For some histone modifications, the levels in Drosophila embryos are substantially different from those previously reported in tissue culture cells. Genome-wide profiling of H3K27 methylation during developmental progression and in animals with reduced PRC2 levels illustrates how mass spectrometry can be used for quantitatively describing and comparing chromatin states. Together, these data provide a foundation toward a quantitative understanding of gene regulation in Drosophila.

Chromatin Assembly and Disassembly , Embryo, Nonmammalian/metabolism , Gene Expression Regulation, Developmental , Histone Code , Animals , Chromatin/genetics , Chromatin/metabolism , Drosophila Proteins/genetics , Drosophila Proteins/metabolism , Drosophila melanogaster , Histone-Lysine N-Methyltransferase/genetics , Histone-Lysine N-Methyltransferase/metabolism , Histones/genetics , Histones/metabolism , Proteome/genetics , Proteome/metabolism

Assessing predictions of the impact of variants on splicing in CAGI5.

Mount, Stephen M; Avsec, Ziga; Carmel, Liran; Casadio, Rita; Çelik, Muhammed Hasan; Chen, Ken; Cheng, Jun; Cohen, Noa E; Fairbrother, William G; Fenesh, Tzila; Gagneur, Julien; Gotea, Valer; Holzer, Tamar; Lin, Chiao-Feng; Martelli, Pier Luigi; Naito, Tatsuhiko; Nguyen, Thi Yen Duong; Savojardo, Castrense; Unger, Ron; Wang, Robert; Yang, Yuedong; Zhao, Huiying.

Hum Mutat ; 40(9): 1215-1224, 2019 09.

Article En | MEDLINE | ID: mdl-31301154

Precision medicine and sequence-based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex-seq and MaPSY) involved prediction of the effect of variants, primarily single-nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high-throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.

Alternative Splicing , Computational Biology/methods , Mutation , Proteins/genetics , Animals , Congresses as Topic , Genetic Fitness , Humans , Models, Genetic , Sequence Homology, Nucleic Acid

CAGI 5 splicing challenge: Improved exon skipping and intron retention predictions with MMSplice.

Cheng, Jun; Çelik, Muhammed Hasan; Nguyen, Thi Yen Duong; Avsec, Ziga; Gagneur, Julien.

Hum Mutat ; 40(9): 1243-1251, 2019 09.

Article En | MEDLINE | ID: mdl-31070280

Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex-seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.

Computational Biology/methods , Genetic Variation , RNA Splicing , Congresses as Topic , Exons , Genetic Predisposition to Disease , Humans , Introns , Models, Genetic , Software

MMSplice: modular modeling improves the predictions of genetic variant effects on splicing.

Cheng, Jun; Nguyen, Thi Yen Duong; Cygan, Kamil J; Çelik, Muhammed Hasan; Fairbrother, William G; Avsec, Ziga; Gagneur, Julien.

Genome Biol ; 20(1): 48, 2019 03 01.

Article En | MEDLINE | ID: mdl-30823901

Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.

Alternative Splicing , Genetic Variation , Models, Genetic , Neural Networks, Computer , Genetic Diseases, Inborn