ABSTRACT
Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. However, while a global change in transcription is recognized as a defining feature of cancer, the contribution of alternative promoters still remains largely unexplored. Here, we infer active promoters using RNA-seq data from 18,468 cancer and normal samples, demonstrating that alternative promoters are a major contributor to context-specific regulation of transcription. We find that promoters are deregulated across tissues, cancer types, and patients, affecting known cancer genes and novel candidates. For genes with independently regulated promoters, we demonstrate that promoter activity provides a more accurate predictor of patient survival than gene expression. Our study suggests that a dynamic landscape of active promoters shapes the cancer transcriptome, opening new diagnostic avenues and opportunities to further explore the interplay of regulatory mechanisms with transcriptional aberrations in cancer.
Subject(s)
Computational Biology/methods , Gene Expression Regulation, Neoplastic/genetics , Neoplasms/genetics , Promoter Regions, Genetic/genetics , Transcriptome/genetics , Databases, Genetic , Humans , RNA-Seq/methodsABSTRACT
Although KDM5C is one of the most frequently mutated genes in X-linked intellectual disability1, the exact mechanisms that lead to cognitive impairment remain unknown. Here we use human patient-derived induced pluripotent stem cells and Kdm5c knockout mice to conduct cellular, transcriptomic, chromatin and behavioural studies. KDM5C is identified as a safeguard to ensure that neurodevelopment occurs at an appropriate timescale, the disruption of which leads to intellectual disability. Specifically, there is a developmental window during which KDM5C directly controls WNT output to regulate the timely transition of primary to intermediate progenitor cells and consequently neurogenesis. Treatment with WNT signalling modulators at specific times reveal that only a transient alteration of the canonical WNT signalling pathway is sufficient to rescue the transcriptomic and chromatin landscapes in patient-derived cells and to induce these changes in wild-type cells. Notably, WNT inhibition during this developmental period also rescues behavioural changes of Kdm5c knockout mice. Conversely, a single injection of WNT3A into the brains of wild-type embryonic mice cause anxiety and memory alterations. Our work identifies KDM5C as a crucial sentinel for neurodevelopment and sheds new light on KDM5C mutation-associated intellectual disability. The results also increase our general understanding of memory and anxiety formation, with the identification of WNT functioning in a transient nature to affect long-lasting cognitive function.
Subject(s)
Cognition , Embryo, Mammalian , Embryonic Development , Histone Demethylases , Wnt Signaling Pathway , Animals , Humans , Mice , Anxiety , Chromatin/drug effects , Chromatin/genetics , Chromatin/metabolism , Embryo, Mammalian/metabolism , Gene Expression Profiling , Histone Demethylases/genetics , Histone Demethylases/metabolism , Induced Pluripotent Stem Cells/cytology , Induced Pluripotent Stem Cells/metabolism , Intellectual Disability/genetics , Memory , Mice, Knockout , Mutation , Neurogenesis/genetics , Wnt Signaling Pathway/drug effectsABSTRACT
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Subject(s)
Gene Expression Profiling , RNA-Seq , Humans , Animals , Mice , RNA-Seq/methods , Gene Expression Profiling/methods , Transcriptome , Sequence Analysis, RNA/methods , Molecular Sequence Annotation/methodsABSTRACT
Most approaches to transcript quantification rely on fixed reference annotations; however, the transcriptome is dynamic and depending on the context, such static annotations contain inactive isoforms for some genes, whereas they are incomplete for others. Here we present Bambu, a method that performs machine-learning-based transcript discovery to enable quantification specific to the context of interest using long-read RNA-sequencing. To identify novel transcripts, Bambu estimates the novel discovery rate, which replaces arbitrary per-sample thresholds with a single, interpretable, precision-calibrated parameter. Bambu retains the full-length and unique read counts, enabling accurate quantification in presence of inactive isoforms. Compared to existing methods for transcript discovery, Bambu achieves greater precision without sacrificing sensitivity. We show that context-aware annotations improve quantification for both novel and known transcripts. We apply Bambu to quantify isoforms from repetitive HERVH-LTR7 retrotransposons in human embryonic stem cells, demonstrating the ability for context-specific transcript expression analysis.
Subject(s)
Gene Expression Profiling , Transcriptome , Humans , RNA-Seq , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Protein Isoforms/geneticsABSTRACT
Transcript alterations often result from somatic changes in cancer genomes1. Various forms of RNA alterations have been described in cancer, including overexpression2, altered splicing3 and gene fusions4; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)5. Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed 'bridged' fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.
Subject(s)
Gene Expression Regulation, Neoplastic , Neoplasms/genetics , RNA/genetics , DNA Copy Number Variations , DNA, Neoplasm , Genome, Human , Genomics , Humans , TranscriptomeABSTRACT
Nanopore sequencing provides signal data corresponding to the nucleotide motifs sequenced. Through machine learning-based methods, these signals are translated into long-read sequences that overcome the read size limit of short-read sequencing. However, analyzing the raw nanopore signal data provides many more opportunities beyond just sequencing genomes and transcriptomes: algorithms that use machine learning approaches to extract biological information from these signals allow the detection of DNA and RNA modifications, the estimation of poly(A) tail length, and the prediction of RNA secondary structures. In this review, we discuss how developments in machine learning methodologies contributed to more accurate basecalling and lower error rates, and how these methods enable new biological discoveries. We argue that direct nanopore sequencing of DNA and RNA provides a new dimensionality for genomics experiments and highlight challenges and future directions for computational approaches to extract the additional information provided by nanopore signal data.
Subject(s)
Nanopore Sequencing , Nanopores , Algorithms , Genomics , High-Throughput Nucleotide Sequencing/methods , Machine Learning , Sequence Analysis, DNA/methodsABSTRACT
RNA modifications such as m6A methylation form an additional layer of complexity in the transcriptome. Nanopore direct RNA sequencing can capture this information in the raw current signal for each RNA molecule, enabling the detection of RNA modifications using supervised machine learning. However, experimental approaches provide only site-level training data, whereas the modification status for each single RNA molecule is missing. Here we present m6Anet, a neural-network-based method that leverages the multiple instance learning framework to specifically handle missing read-level modification labels in site-level training data. m6Anet outperforms existing computational methods, shows similar accuracy as experimental approaches, and generalizes with high accuracy to different cell lines and species without retraining model parameters. In addition, we demonstrate that m6Anet captures the underlying read-level stoichiometry, which can be used to approximate differences in modification rates. Overall, m6Anet offers a tool to capture the transcriptome-wide identification and quantification of m6A from a single run of direct RNA sequencing.
Subject(s)
Nanopore Sequencing , RNA , RNA/genetics , RNA/metabolism , Sequence Analysis, RNA/methods , Methylation , TranscriptomeABSTRACT
MOTIVATION: The process of analyzing high throughput sequencing data often requires the identification and extraction of specific target sequences. This could include tasks, such as identifying cellular barcodes and UMIs in single-cell data, and specific genetic variants for genotyping. However, existing tools, which perform these functions are often task-specific, such as only demultiplexing barcodes for a dedicated type of experiment, or are not tolerant to noise in the sequencing data. RESULTS: To overcome these limitations, we developed Flexiplex, a versatile and fast sequence searching and demultiplexing tool for omics data, which is based on the Levenshtein distance and thus allows imperfect matches. We demonstrate Flexiplex's application on three use cases, identifying cell-line-specific sequences in Illumina short-read single-cell data, and discovering and demultiplexing cellular barcodes from noisy long-read single-cell RNA-seq data. We show that Flexiplex achieves an excellent balance of accuracy and computational efficiency compared to leading task-specific tools. AVAILABILITY AND IMPLEMENTATION: Flexiplex is available at https://davidsongroup.github.io/flexiplex/.
Subject(s)
Search Engine , Software , Sequence Analysis, DNA , High-Throughput Nucleotide Sequencing , Electronic Data ProcessingABSTRACT
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Subject(s)
Biomedical Research/methods , Biomedical Research/standards , Computational Biology/methods , Workflow , Reproducibility of ResultsABSTRACT
Understanding how epigenetic variation in non-coding regions is involved in distal gene-expression regulation is an important problem. Regulatory regions can be associated to genes using large-scale datasets of epigenetic and expression data. However, for regions of complex epigenomic signals and enhancers that regulate many genes, it is difficult to understand these associations. We present StitchIt, an approach to dissect epigenetic variation in a gene-specific manner for the detection of regulatory elements (REMs) without relying on peak calls in individual samples. StitchIt segments epigenetic signal tracks over many samples to generate the location and the target genes of a REM simultaneously. We show that this approach leads to a more accurate and refined REM detection compared to standard methods even on heterogeneous datasets, which are challenging to model. Also, StitchIt REMs are highly enriched in experimentally determined chromatin interactions and expression quantitative trait loci. We validated several newly predicted REMs using CRISPR-Cas9 experiments, thereby demonstrating the reliability of StitchIt. StitchIt is able to dissect regulation in superenhancers and predicts thousands of putative REMs that go unnoticed using peak-based approaches suggesting that a large part of the regulome might be uncharted water.
Subject(s)
Chromatin/metabolism , Data Analysis , Enhancer Elements, Genetic , Epigenesis, Genetic , Gene Expression Regulation , Human Umbilical Vein Endothelial Cells , HumansABSTRACT
OBJECTIVES: Epigenomic alterations in cancer interact with the immune microenvironment to dictate tumour evolution and therapeutic response. We aimed to study the regulation of the tumour immune microenvironment through epigenetic alternate promoter use in gastric cancer and to expand our findings to other gastrointestinal tumours. DESIGN: Alternate promoter burden (APB) was quantified using a novel bioinformatic algorithm (proActiv) to infer promoter activity from short-read RNA sequencing and samples categorised into APBhigh, APBint and APBlow. Single-cell RNA sequencing was performed to analyse the intratumour immune microenvironment. A humanised mouse cancer in vivo model was used to explore dynamic temporal interactions between tumour kinetics, alternate promoter usage and the human immune system. Multiple cohorts of gastrointestinal tumours treated with immunotherapy were assessed for correlation between APB and treatment outcomes. RESULTS: APBhigh gastric cancer tumours expressed decreased levels of T-cell cytolytic activity and exhibited signatures of immune depletion. Single-cell RNAsequencing analysis confirmed distinct immunological populations and lower T-cell proportions in APBhigh tumours. Functional in vivo studies using 'humanised mice' harbouring an active human immune system revealed distinct temporal relationships between APB and tumour growth, with APBhigh tumours having almost no human T-cell infiltration. Analysis of immunotherapy-treated patients with GI cancer confirmed resistance of APBhigh tumours to immune checkpoint inhibition. APBhigh gastric cancer exhibited significantly poorer progression-free survival compared with APBlow (median 55 days vs 121 days, HR 0.40, 95% CI 0.18 to 0.93, p=0.032). CONCLUSION: These findings demonstrate an association between alternate promoter use and the tumour microenvironment, leading to immune evasion and immunotherapy resistance.
Subject(s)
Gastrointestinal Neoplasms , Stomach Neoplasms , Animals , Epigenesis, Genetic , Epigenomics , Gastrointestinal Neoplasms/genetics , Gastrointestinal Neoplasms/therapy , Humans , Immune Checkpoint Inhibitors/pharmacology , Immune Checkpoint Inhibitors/therapeutic use , Immunotherapy , Mice , Stomach Neoplasms/drug therapy , Stomach Neoplasms/therapy , Tumor MicroenvironmentABSTRACT
The extracellular signal-regulated kinase (ERK)/mitogen-activated protein kinase signal-transduction cascade is one of the key pathways regulating proliferation and differentiation in development and disease. ERK signaling is required for human embryonic stem cells' (hESCs') self-renewing property. Here, we studied the convergence of the ERK signaling cascade at the DNA by mapping genome-wide kinase-chromatin interactions for ERK2 in hESCs. We observed that ERK2 binding occurs near noncoding genes and histone, cell-cycle, metabolism, and pluripotency-associated genes. We find that the transcription factor ELK1 is essential in hESCs and that ERK2 co-occupies promoters bound by ELK1. Strikingly, promoters bound by ELK1 without ERK2 are occupied by Polycomb group proteins that repress genes involved in lineage commitment. In summary, we propose a model wherein extracellular-signaling-stimulated proliferation and intrinsic repression of differentiation are integrated to maintain the identity of hESCs.
Subject(s)
Chromatin/enzymology , Embryonic Stem Cells/enzymology , MAP Kinase Signaling System , Mitogen-Activated Protein Kinase 1/metabolism , ets-Domain Protein Elk-1/metabolism , Base Sequence , Cell Differentiation , Cell Lineage , Cells, Cultured , Consensus Sequence , Embryonic Stem Cells/physiology , Gene Expression Regulation , Gene Knockdown Techniques , Genome, Human , Humans , Mitogen-Activated Protein Kinase 1/genetics , Polycomb-Group Proteins/genetics , Polycomb-Group Proteins/metabolism , Promoter Regions, Genetic , Protein Binding , RNA, Small Interfering/genetics , RNA, Small Nuclear/genetics , RNA, Small Nuclear/metabolism , Transcription, Genetic , Transcriptome , ets-Domain Protein Elk-1/geneticsABSTRACT
Endogenous retroviruses (ERVs) contribute to â¼10 percent of the mouse genome. They are often silenced in differentiated somatic cells but differentially expressed at various embryonic developmental stages. A minority of mouse embryonic stem cells (ESCs), like 2-cell cleavage embryos, highly express ERV MERVL. However, the role of ERVs and mechanism of their activation in these cells are still poorly understood. In this study, we investigated the regulation and function of the stage-specific expressed ERVs, with a particular focus on the totipotency marker MT2/MERVL. We show that the transcription factor Zscan4c functions as an activator of MT2/MERVL and 2-cell/4-cell embryo genes. Zinc finger domains of Zscan4c play an important role in this process. In addition, Zscan4c interacts with MT2 and regulates MT2-nearby 2-cell/4-cell genes through promoting enhancer activity of MT2. Furthermore, MT2 activation is accompanied by enhanced H3K4me1, H3K27ac, and H3K14ac deposition on MT2. Zscan4c also interacts with GBAF chromatin remodelling complex through SCAN domain to further activate MT2 enhancer activity. Taken together, we delineate a previously unrecognized regulatory axis that Zscan4c interacts with and activates MT2/MERVL loci and their nearby genes through epigenetic regulation.
Subject(s)
Endogenous Retroviruses/genetics , Gene Expression Regulation, Developmental , Genome , Histones/metabolism , Retroelements , Transcription Factors/genetics , Animals , Chromatin/chemistry , Chromatin/metabolism , Embryo, Mammalian , Endogenous Retroviruses/metabolism , Enhancer Elements, Genetic , Epigenesis, Genetic , Gene Expression Profiling , Gene Ontology , Histones/genetics , Mice , Molecular Sequence Annotation , Mouse Embryonic Stem Cells/cytology , Mouse Embryonic Stem Cells/metabolism , Transcription Factors/metabolismABSTRACT
Motivation: International consortia such as the Genotype-Tissue Expression (GTEx) project, The Cancer Genome Atlas (TCGA) or the International Human Epigenetics Consortium (IHEC) have produced a wealth of genomic datasets with the goal of advancing our understanding of cell differentiation and disease mechanisms. However, utilizing all of these data effectively through integrative analysis is hampered by batch effects, large cell type heterogeneity and low replicate numbers. To study if batch effects across datasets can be observed and adjusted for, we analyze RNA-seq data of 215 samples from ENCODE, Roadmap, BLUEPRINT and DEEP as well as 1336 samples from GTEx and TCGA. While batch effects are a considerable issue, it is non-trivial to determine if batch adjustment leads to an improvement in data quality, especially in cases of low replicate numbers. Results: We present a novel method for assessing the performance of batch effect adjustment methods on heterogeneous data. Our method borrows information from the Cell Ontology to establish if batch adjustment leads to a better agreement between observed pairwise similarity and similarity of cell types inferred from the ontology. A comparison of state-of-the art batch effect adjustment methods suggests that batch effects in heterogeneous datasets with low replicate numbers cannot be adequately adjusted. Better methods need to be developed, which can be assessed objectively in the framework presented here. Availability and implementation: Our method is available online at https://github.com/SchulzLab/OntologyEval. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Datasets as Topic , Sequence Analysis, RNA/methods , Data Accuracy , Gene Ontology , Genome, Human , Genomics , Humans , RNA/geneticsABSTRACT
The 2i-media, composed of two small molecule inhibitors (PD0325901 and CHIR99021) against MEK and GSK3-kinases, respectively, is known to establish naïve ground state pluripotency in mouse embryonic stem cells (mESCs). These inhibitors block MEK-mediated differentiation, while driving ß-catenin dependent de-repression of pluripotency promoting targets. However, accumulating evidence suggest that ß-catenin's association with activating TCFs (TCF7 and TCF7L2) can induce expression of several lineage-specific prodifferentiation genes. We posited that CHIR-induced upregulation of ß-catenin levels could therefore compromise the stability of the naïve state in long-term cultures. Here, we investigated whether replacing CHIR with iCRT3, a small molecule that abrogates ß-catenin-TCF interaction, can still retain ground state pluripotency in mESCs. Our data suggests that iCRT3 + PD mediated coinhibition of MEK and ß-catenin/TCF-dependent transcriptional activity over multiple passages significantly reduces expression of differentiation markers, as compared to 2i. Furthermore, the ability to efficiently contribute toward chimera generation and germline transmission suggests that the inhibition of ß-catenin's TCF-dependent transcriptional activity, independent of its protein expression level, retains the naïve ground state pluripotency in mESCs. Additionally, growth medium containing iCRT3 + PD can provide an alternative to 2i as a stable culture method. Stem Cells 2017;35:1924-1933.
Subject(s)
Hepatocyte Nuclear Factor 1-alpha/metabolism , Mouse Embryonic Stem Cells/metabolism , Pluripotent Stem Cells/metabolism , Transcription Factor 7-Like 2 Protein/metabolism , beta Catenin/metabolism , Animals , Benzamides/pharmacology , Biomarkers/metabolism , Cell Differentiation/drug effects , Cell Differentiation/genetics , Cells, Cultured , Diphenylamine/analogs & derivatives , Diphenylamine/pharmacology , Female , Mice , Mice, Inbred C57BL , Mouse Embryonic Stem Cells/cytology , Mouse Embryonic Stem Cells/drug effects , Oxazoles/pharmacology , Pluripotent Stem Cells/drug effects , Protein Binding/drug effects , Pyridines/pharmacology , Pyrimidines/pharmacology , Transcriptome/drug effects , Transcriptome/genetics , Up-Regulation/drug effects , Up-Regulation/geneticsABSTRACT
The human genome contains millions of fragments from retrotransposons-highly repetitive DNA sequences that were once able to "copy and paste" themselves to other regions in the genome. However, the majority of retrotransposons have lost this capacity through acquisition of mutations or through endogenous silencing mechanisms. Without this imminent threat of transposition, retrotransposons have the potential to act as a major source of genomic innovation. Indeed, large numbers of retrotransposons have been found to be active in specific contexts: as gene regulatory elements and promoters for protein-coding genes or long noncoding RNAs, among others. In this review, we summarise recent findings about retrotransposons, with implications in gene expression regulation, the expansion of gene isoform diversity and the generation of long noncoding RNAs. We highlight key examples that demonstrate their role in cellular identity and their versatility as markers of cell states, and we discuss how their dysregulation may contribute to the formation of and possibly therapeutic response in human cancers.
Subject(s)
Gene Expression Regulation , Retroelements , Transcriptome , Alternative Splicing , Animals , Endogenous Retroviruses/genetics , Genetic Markers , Genome, Human , Genomic Instability , Humans , Long Interspersed Nucleotide Elements , Neoplasms/genetics , Neoplasms/immunology , Neoplasms/therapy , Oncogenes , Organ Specificity/genetics , RNA, Long Noncoding/genetics , Regulatory Sequences, Nucleic AcidABSTRACT
In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.
Subject(s)
Computational Biology/methods , Gene Expression , Gene Expression Regulation , Genetic Predisposition to Disease , Humans , Quantitative Trait LociABSTRACT
PRDM14 is an important determinant of the human embryonic stem cell (ESC) identity and works in concert with the core ESC regulators to activate pluripotency-associated genes. PRDM14 has been previously reported to exhibit repressive activity in mouse ESCs and primordial germ cells; and while PRDM14 has been implicated to suppress differentiation genes in human ESCs, the exact mechanism of this repressive activity remains unknown. In this study, we provide evidence that PRDM14 is a direct repressor of developmental genes in human ESCs. PRDM14 binds to silenced genes in human ESCs and its global binding profile is enriched for the repressive trimethylation of histone H3 lysine 27 (H3K27me3) modification. Further investigation reveals that PRDM14 interacts directly with the chromatin regulator polycomb repressive complex 2 (PRC2) and PRC2 binding is detected at PRDM14-bound loci in human ESCs. Depletion of PRDM14 reduces PRC2 binding at these loci and the concomitant reduction of H3K27me3 modification. Using reporter assays, we demonstrate that gene loci bound by PRDM14 exhibit repressive activity that is dependent on both PRDM14 and PRC2. In reprogramming human fibroblasts into induced pluripotent stem cells (iPSCs), ectopically expressed PRDM14 can repress these developmental genes in fibroblasts. In addition, we show that PRDM14 recruits PRC2 to repress a key mesenchymal gene ZEB1, which enhances mesenchymal-to-epithelial transition in the initiation event of iPSC reprogramming. In summary, our study reveals a repressive role of PRDM14 in the maintenance and induction of pluripotency and identifies PRDM14 as a new regulator of PRC2.
Subject(s)
Cellular Reprogramming/physiology , Embryonic Stem Cells/metabolism , Induced Pluripotent Stem Cells/metabolism , Polycomb Repressive Complex 2/metabolism , Repressor Proteins/metabolism , Cellular Reprogramming/genetics , DNA-Binding Proteins , Embryonic Stem Cells/cytology , Homeodomain Proteins/genetics , Homeodomain Proteins/metabolism , Humans , Induced Pluripotent Stem Cells/cytology , Polycomb Repressive Complex 2/genetics , Protein Binding , RNA-Binding Proteins , Repressor Proteins/genetics , Transcription Factors/genetics , Transcription Factors/metabolism , Zinc Finger E-box-Binding Homeobox 1ABSTRACT
MOTIVATION: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets. RESULTS: We present the standardized alignment-free sequence similarity measure N2, a flexible framework that is defined for word neighbourhoods. We explore the usefulness of adding reverse complement words as well as words including mismatches into the neighbourhood. On simulated enhancer sequences as well as functional enhancers in mouse development, N2 is shown to outperform previous alignment-free measures. N2 is flexible, faster than competing methods and less susceptible to single sequence noise and the occurrence of repetitive sequences. Experiments on the mouse enhancers reveal that enhancers active in different tissues can be separated by pairwise comparison using N2. CONCLUSION: N2 represents an improvement over previous alignment-free similarity measures without compromising speed, which makes it a good candidate for large-scale sequence comparison of regulatory sequences. AVAILABILITY: The software is part of the open-source C++ library SeqAn (www.seqan.de) and a compiled version can be downloaded at http://www.seqan.de/projects/alf.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.