RESUMO
Non-coding genetic variation is a major driver of phenotypic diversity and allows the investigation of mechanisms that control gene expression. Here, we systematically investigated the effects of >50 million variations from five strains of mice on mRNA, nascent transcription, transcription start sites, and transcription factor binding in resting and activated macrophages. We observed substantial differences associated with distinct molecular pathways. Evaluating genetic variation provided evidence for roles of â¼100 TFs in shaping lineage-determining factor binding. Unexpectedly, a substantial fraction of strain-specific factor binding could not be explained by local mutations. Integration of genomic features with chromatin interaction data provided evidence for hundreds of connected cis-regulatory domains associated with differences in transcription factor binding and gene expression. This system and the >250 datasets establish a substantial new resource for investigation of how genetic variation affects cellular phenotypes.
Assuntos
Variação Genética , Macrófagos/metabolismo , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Células da Medula Óssea/citologia , Proteína beta Intensificadora de Ligação a CCAAT/genética , Proteína beta Intensificadora de Ligação a CCAAT/metabolismo , Análise por Conglomerados , Elementos Facilitadores Genéticos/genética , Feminino , Regulação da Expressão Gênica/efeitos dos fármacos , Lipopolissacarídeos/farmacologia , Macrófagos/citologia , Macrófagos/efeitos dos fármacos , Masculino , Camundongos , Camundongos Endogâmicos BALB C , Camundongos Endogâmicos C57BL , Camundongos Endogâmicos NOD , Regiões Promotoras Genéticas , Ligação Proteica , Proteínas Proto-Oncogênicas/genética , Proteínas Proto-Oncogênicas/metabolismo , Transativadores/genética , Transativadores/metabolismo , Fatores de Transcrição/genéticaRESUMO
Eukaryotic transcription factors (TFs) form complexes with various partner proteins to recognize their genomic target sites. Yet, how the DNA sequence determines which TF complex forms at any given site is poorly understood. Here, we demonstrate that high-throughput in vitro DNA binding assays coupled with unbiased computational analysis provide unprecedented insight into how different DNA sequences select distinct compositions and configurations of homeodomain TF complexes. Using inferred knowledge about minor groove width readout, we design targeted protein mutations that destabilize homeodomain binding both in vitro and in vivo in a complex-specific manner. By performing parallel systematic evolution of ligands by exponential enrichment sequencing (SELEX-seq), chromatin immunoprecipitation sequencing (ChIP-seq), RNA sequencing (RNA-seq), and Hi-C assays, we not only classify the majority of in vivo binding events in terms of complex composition but also infer complex-specific functions by perturbing the gene regulatory network controlled by a single complex.
Assuntos
DNA/química , Proteínas de Drosophila/metabolismo , Regulação da Expressão Gênica , Proteínas de Homeodomínio/metabolismo , Fatores de Transcrição/metabolismo , Animais , Sequência de Bases , Sítios de Ligação , DNA/metabolismo , Proteínas de Drosophila/química , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Proteínas de Homeodomínio/química , Proteínas de Homeodomínio/genética , Mutação , Conformação de Ácido Nucleico , Ligação Proteica , Fatores de Transcrição/química , Fatores de Transcrição/genéticaRESUMO
Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
Assuntos
Redes Neurais de Computação , Fatores de Transcrição , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Sítios de Ligação , Algoritmos , Biologia Computacional/métodos , Humanos , Aprendizado Profundo , Ligação ProteicaRESUMO
Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.
Assuntos
Aprendizado Profundo , Fatores de Transcrição , Fatores de Transcrição/metabolismo , Sítios de Ligação , Algoritmos , Biologia Computacional/métodos , Humanos , Sequenciamento de Cromatina por Imunoprecipitação/métodos , DNA/metabolismo , DNA/químicaRESUMO
Ultraviolet (UV) light induces different classes of mutagenic photoproducts in DNA, namely cyclobutane pyrimidine dimers (CPDs), 6-4 photoproducts (6-4PPs), and atypical thymine-adenine photoproducts (TA-PPs). CPD formation is modulated by nucleosomes and transcription factors (TFs), which has important ramifications for Ultraviolet (UV) mutagenesis. How chromatin affects the formation of 6-4PPs and TA-PPs is unclear. Here, we use UV damage endonuclease-sequencing (UVDE-seq) to map these UV photoproducts across the yeast genome. Our results indicate that nucleosomes, the fundamental building block of chromatin, have opposing effects on photoproduct formation. Nucleosomes induce CPDs and 6-4PPs at outward rotational settings in nucleosomal DNA but suppress TA-PPs at these settings. Our data also indicate that DNA binding by different classes of yeast TFs causes lesion-specific hotspots of 6-4PPs or TA-PPs. For example, DNA binding by the TF Rap1 generally suppresses CPD and 6-4PP formation but induces a TA-PP hotspot. Finally, we show that 6-4PP formation is strongly induced at the binding sites of TATA-binding protein (TBP), which is correlated with higher mutation rates in UV-exposed yeast. These results indicate that the formation of 6-4PPs and TA-PPs is modulated by chromatin differently than CPDs and that this may have important implications for UV mutagenesis.
Assuntos
Cromatina , Saccharomyces cerevisiae , Cromatina/genética , Saccharomyces cerevisiae/genética , Nucleossomos/genética , Mutagênese , Mutagênicos , Adenina , Dímeros de Pirimidina/genéticaRESUMO
The K50 (lysine at amino acid position 50) homeodomain (HD) protein Orthodenticle (Otd) is critical for anterior patterning and brain and eye development in most metazoans. In Drosophila melanogaster, another K50HD protein, Bicoid (Bcd), has evolved to replace Otd's ancestral function in embryo patterning. Bcd is distributed as a long-range maternal gradient and activates transcription of a large number of target genes, including otd Otd and Bcd bind similar DNA sequences in vitro, but how their transcriptional activities are integrated to pattern anterior regions of the embryo is unknown. Here we define three major classes of enhancers that are differentially sensitive to binding and transcriptional activation by Bcd and Otd. Class 1 enhancers are initially activated by Bcd, and activation is transferred to Otd via a feed-forward relay (FFR) that involves sequential binding of the two proteins to the same DNA motif. Class 2 enhancers are activated by Bcd and maintained by an Otd-independent mechanism. Class 3 enhancers are never bound by Bcd, but Otd binds and activates them in a second wave of zygotic transcription. The specific activities of enhancers in each class are mediated by DNA motif variants preferentially bound by Bcd or Otd and the presence or absence of sites for cofactors that interact with these proteins. Our results define specific patterning roles for Bcd and Otd and provide mechanisms for coordinating the precise timing of gene expression patterns during embryonic development.
Assuntos
Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/embriologia , Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , Proteínas de Homeodomínio/genética , Proteínas de Homeodomínio/metabolismo , Transativadores/genética , Transativadores/metabolismo , Motivos de Aminoácidos , Animais , Padronização Corporal/genética , Drosophila melanogaster/metabolismo , Desenvolvimento Embrionário/efeitos dos fármacos , Desenvolvimento Embrionário/genética , Elementos Facilitadores Genéticos/genética , Ligação ProteicaRESUMO
Evolution of gene expression mediated by cis-regulatory changes is thought to be an important contributor to organismal adaptation, but identifying adaptive cis-regulatory changes is challenging due to the difficulty in knowing the expectation under no positive selection. A new approach for detecting positive selection on transcription factor binding sites (TFBSs) was recently developed, thanks to the application of machine learning in predicting transcription factor (TF) binding affinities of DNA sequences. Given a TFBS sequence from a focal species and the corresponding inferred ancestral sequence that differs from the former at n sites, one can predict the TF-binding affinities of many n-step mutational neighbors of the ancestral sequence and obtain a null distribution of the derived binding affinity, which allows testing whether the binding affinity of the real derived sequence deviates significantly from the null distribution. Applying this test genomically to all experimentally identified binding sites of 3 TFs in humans, a recent study reported positive selection for elevated binding affinities of TFBSs. Here, we show that this genomic test suffers from an ascertainment bias because, even in the absence of positive selection for strengthened binding, the binding affinities of known human TFBSs are more likely to have increased than decreased in evolution. We demonstrate by computer simulation that this bias inflates the false positive rate of the selection test. We propose several methods to mitigate the ascertainment bias and show that almost all previously reported positive selection signals disappear when these methods are applied.
Assuntos
Genômica , Fatores de Transcrição , Humanos , Fatores de Transcrição/metabolismo , Simulação por Computador , Sítios de Ligação/genética , Ligação ProteicaRESUMO
A long-standing biological question is how DNA cis-regulatory elements shape transcriptional patterns during metazoan development. Reporter constructs, cell culture assays and computational modeling have made major contributions to answering this question, but analysis of elements in their natural context is an important complement. Here, we mutate Notch-dependent LAG-1 binding sites (LBSs) in the endogenous Caenorhabditis elegans sygl-1 gene, which encodes a key stem cell regulator, and analyze the consequences on sygl-1 expression (nascent transcripts, mRNA, protein) and stem cell maintenance. Mutation of one LBS in a three-element cluster approximately halved both expression and stem cell pool size, whereas mutation of two LBSs essentially abolished them. Heterozygous LBS mutant clusters provided intermediate values. Our results lead to two major conclusions. First, both LBS number and configuration impact cluster activity: LBSs act additively in trans and synergistically in cis. Second, the SYGL-1 gradient promotes self-renewal above its functional threshold and triggers differentiation below the threshold. Our approach of coupling CRISPR/Cas9 LBS mutations with effects on both molecular and biological readouts establishes a powerful model for in vivo analyses of DNA cis-regulatory elements.
Assuntos
Caenorhabditis elegans , Elementos Reguladores de Transcrição , Células-Tronco , Animais , Caenorhabditis elegans/citologia , Caenorhabditis elegans/metabolismo , Proteínas de Caenorhabditis elegans/genética , Autorrenovação Celular , DNA/metabolismo , Proteínas de Ligação a DNA/genética , Receptores Notch , Células-Tronco/citologiaRESUMO
Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.
Assuntos
DNA , Redes Neurais de Computação , Ligação Proteica , Sítios de Ligação , Fatores de Transcrição/genéticaRESUMO
Cyclic AMP receptor proteins (CRPs) are important transcription regulators in many species. The prediction of CRP-binding sites was mainly based on position-weighted matrixes (PWMs). Traditional prediction methods only considered known binding motifs, and their ability to discover inflexible binding patterns was limited. Thus, a novel CRP-binding site prediction model called CRPBSFinder was developed in this research, which combined the hidden Markov model, knowledge-based PWMs and structure-based binding affinity matrixes. We trained this model using validated CRP-binding data from Escherichia coli and evaluated it with computational and experimental methods. The result shows that the model not only can provide higher prediction performance than a classic method but also quantitatively indicates the binding affinity of transcription factor binding sites by prediction scores. The prediction result included not only the most knowns regulated genes but also 1089 novel CRP-regulated genes. The major regulatory roles of CRPs were divided into four classes: carbohydrate metabolism, organic acid metabolism, nitrogen compound metabolism and cellular transport. Several novel functions were also discovered, including heterocycle metabolic and response to stimulus. Based on the functional similarity of homologous CRPs, we applied the model to 35 other species. The prediction tool and the prediction results are online and are available at: https://awi.cuhk.edu.cn/â¼CRPBSFinder.
Assuntos
Proteína Receptora de AMP Cíclico , Proteínas de Escherichia coli , Proteína Receptora de AMP Cíclico/genética , Proteína Receptora de AMP Cíclico/química , Proteína Receptora de AMP Cíclico/metabolismo , Proteínas de Escherichia coli/metabolismo , Escherichia coli/genética , Escherichia coli/metabolismo , Sítios de Ligação/genética , Ligação Proteica/genéticaRESUMO
Precise targeting of transcription factor binding sites (TFBSs) is essential to comprehending transcriptional regulatory processes and investigating cellular function. Although several deep learning algorithms have been created to predict TFBSs, the models' intrinsic mechanisms and prediction results are difficult to explain. There is still room for improvement in prediction performance. We present DeepSTF, a unique deep-learning architecture for predicting TFBSs by integrating DNA sequence and shape profiles. We use the improved transformer encoder structure for the first time in the TFBSs prediction approach. DeepSTF extracts DNA higher-order sequence features using stacked convolutional neural networks (CNNs), whereas rich DNA shape profiles are extracted by combining improved transformer encoder structure and bidirectional long short-term memory (Bi-LSTM), and, finally, the derived higher-order sequence features and representative shape profiles are integrated into the channel dimension to achieve accurate TFBSs prediction. Experiments on 165 ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) datasets show that DeepSTF considerably outperforms several state-of-the-art algorithms in predicting TFBSs, and we explain the usefulness of the transformer encoder structure and the combined strategy using sequence features and shape profiles in capturing multiple dependencies and learning essential features. In addition, this paper examines the significance of DNA shape features predicting TFBSs. The source code of DeepSTF is available at https://github.com/YuBinLab-QUST/DeepSTF/.
Assuntos
DNA , Redes Neurais de Computação , Sítios de Ligação , Ligação Proteica , DNA/genética , DNA/química , Fatores de Transcrição/genética , Fatores de Transcrição/químicaRESUMO
Genome-wide association studies (GWAS) are a powerful tool for detecting variants associated with complex traits and can help risk stratification and prevention strategies against pancreatic ductal adenocarcinoma (PDAC). However, the strict significance threshold commonly used makes it likely that many true risk loci are missed. Functional annotation of GWAS polymorphisms is a proven strategy to identify additional risk loci. We aimed to investigate single-nucleotide polymorphisms (SNP) in regulatory regions [transcription factor binding sites (TFBSs) and enhancers] that could change the expression profile of multiple genes they act upon and thereby modify PDAC risk. We analyzed a total of 12,636 PDAC cases and 43,443 controls from PanScan/PanC4 and the East Asian GWAS (discovery populations), and the PANDoRA consortium (replication population). We identified four associations that reached study-wide statistical significance in the overall meta-analysis: rs2472632(A) (enhancer variant, OR 1.10, 95%CI 1.06,1.13, p = 5.5 × 10-8), rs17358295(G) (enhancer variant, OR 1.16, 95%CI 1.10,1.22, p = 6.1 × 10-7), rs2232079(T) (TFBS variant, OR 0.88, 95%CI 0.83,0.93, p = 6.4 × 10-6) and rs10025845(A) (TFBS variant, OR 1.88, 95%CI 1.50,1.12, p = 1.32 × 10-5). The SNP with the most significant association, rs2472632, is located in an enhancer predicted to target the coiled-coil domain containing 34 oncogene. Our results provide new insights into genetic risk factors for PDAC by a focused analysis of polymorphisms in regulatory regions and demonstrating the usefulness of functional prioritization to identify loci associated with PDAC risk.
Assuntos
Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Humanos , Estudo de Associação Genômica Ampla , Predisposição Genética para Doença , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/epidemiologia , Neoplasias Pancreáticas/patologia , Carcinoma Ductal Pancreático/genética , Carcinoma Ductal Pancreático/patologia , Sequências Reguladoras de Ácido Nucleico , Polimorfismo de Nucleotídeo Único/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Sítios de Ligação/genéticaRESUMO
BACKGROUND: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon. METHOD: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral. CONCLUSIONS: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.
Assuntos
Metilação de DNA , Metilação de DNA/genética , Sítios de Ligação , Análise de Sequência de DNA/métodos , Ilhas de CpG , Software , Humanos , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Regiões Promotoras GenéticasRESUMO
Enhancers are critical cis-regulatory elements controlling gene expression during cell development and differentiation. However, genome-wide enhancer characterization has been challenging due to the lack of a well-defined relationship between enhancers and genes. Function-based methods are the gold standard for determining the biological function of cis-regulatory elements; however, these methods have not been widely applied to plants. Here, we applied a massively parallel reporter assay on Arabidopsis to measure enhancer activities across the genome. We identified 4327 enhancers with various combinations of epigenetic modifications distinctively different from animal enhancers. Furthermore, we showed that enhancers differ from promoters in their preference for transcription factors. Although some enhancers are not conserved and overlap with transposable elements forming clusters, enhancers are generally conserved across thousand Arabidopsis accessions, suggesting they are selected under evolution pressure and could play critical roles in the regulation of important genes. Moreover, comparison analysis reveals that enhancers identified by different strategies do not overlap, suggesting these methods are complementary in nature. In sum, we systematically investigated the features of enhancers identified by functional assay in A. thaliana, which lays the foundation for further investigation into enhancers' functional mechanisms in plants.
Assuntos
Arabidopsis , Animais , Arabidopsis/genética , Elementos Facilitadores Genéticos/genética , Regiões Promotoras Genéticas/genética , Fatores de Transcrição/genética , Epigênese GenéticaRESUMO
BACKGROUND: Transcription factors (TFs) bind to different parts of the genome in different types of cells, but it is usually assumed that the inherent DNA-binding preferences of a TF are invariant to cell type. Yet, there are several known examples of TFs that switch their DNA-binding preferences in different cell types, and yet more examples of other mechanisms, such as steric hindrance or cooperative binding, that may result in a "DNA signature" of differential binding. RESULTS: To survey this phenomenon systematically, we developed a deep learning method we call SigTFB (Signatures of TF Binding) to detect and quantify cell-type specificity in a TF's known genomic binding sites. We used ENCODE ChIP-seq data to conduct a wide scale investigation of 169 distinct TFs in up to 14 distinct cell types. SigTFB detected statistically significant DNA binding signatures in approximately two-thirds of TFs, far more than might have been expected from the relatively sparse evidence in prior literature. We found that the presence or absence of a cell-type specific DNA binding signature is distinct from, and indeed largely uncorrelated to, the degree of overlap between ChIP-seq peaks in different cell types, and tended to arise by two mechanisms: using established motifs in different frequencies, and by selective inclusion of motifs for distint TFs. CONCLUSIONS: While recent results have highlighted cell state features such as chromatin accessibility and gene expression in predicting TF binding, our results emphasize that, for some TFs, the DNA sequences of the binding sites contain substantial cell-type specific motifs.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , DNA , Fatores de Transcrição , Fatores de Transcrição/metabolismo , Humanos , Sítios de Ligação , DNA/metabolismo , Ligação Proteica , Motivos de Nucleotídeos , Aprendizado Profundo , Especificidade de Órgãos , Biologia Computacional/métodosRESUMO
BACKGROUND: Identifying the DNA-binding specificities of transcription factors (TF) is central to understanding gene networks that regulate growth and development. Such knowledge is lacking in oomycetes, a microbial eukaryotic lineage within the stramenopile group. Oomycetes include many important plant and animal pathogens such as the potato and tomato blight agent Phytophthora infestans, which is a tractable model for studying life-stage differentiation within the group. RESULTS: Mining of the P. infestans genome identified 197 genes encoding proteins belonging to 22 TF families. Their chromosomal distribution was consistent with family expansions through unequal crossing-over, which were likely ancient since each family had similar sizes in most oomycetes. Most TFs exhibited dynamic changes in RNA levels through the P. infestans life cycle. The DNA-binding preferences of 123 proteins were assayed using protein-binding oligonucleotide microarrays, which succeeded with 73 proteins from 14 families. Binding sites predicted for representatives of the families were validated by electrophoretic mobility shift or chromatin immunoprecipitation assays. Consistent with the substantial evolutionary distance of oomycetes from traditional model organisms, only a subset of the DNA-binding preferences resembled those of human or plant orthologs. Phylogenetic analyses of the TF families within P. infestans often discriminated clades with canonical and novel DNA targets. Paralogs with similar binding preferences frequently had distinct patterns of expression suggestive of functional divergence. TFs were predicted to either drive life stage-specific expression or serve as general activators based on the representation of their binding sites within total or developmentally-regulated promoters. This projection was confirmed for one TF using synthetic and mutated promoters fused to reporter genes in vivo. CONCLUSIONS: We established a large dataset of binding specificities for P. infestans TFs, representing the first in the stramenopile group. This resource provides a basis for understanding transcriptional regulation by linking TFs with their targets, which should help delineate the molecular components of processes such as sporulation and host infection. Our work also yielded insight into TF evolution during the eukaryotic radiation, revealing both functional conservation as well as diversification across kingdoms.
Assuntos
Evolução Molecular , Filogenia , Phytophthora infestans , Fatores de Transcrição , Phytophthora infestans/genética , Phytophthora infestans/metabolismo , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Sítios de Ligação , Ligação ProteicaRESUMO
Changes in transcription factor binding sites (TFBSs) can alter the spatiotemporal expression pattern and transcript abundance of genes. Loss and gain of TFBSs were shown to cause shifts in expression patterns in numerous cases. However, we know little about the evolution of extended regulatory sequences incorporating many TFBSs. We compare, across the crucifers (Brassicaceae, cabbage family), the sequences between the translated regions of Arabidopsis Bsister (ABS)-like MADS-box genes (including paralogous GOA-like genes) and the next gene upstream, as an example of family-wide evolution of putative upstream regulatory regions (PURRs). ABS-like genes are essential for integument development of ovules and endothelium formation in seeds of Arabidopsis thaliana. A combination of motif-based gene ontology enrichment and reporter gene analysis using A. thaliana as common trans-regulatory environment allows analysis of selected Brassicaceae Bsister gene PURRs. Comparison of TFBS of transcriptionally active ABS-like genes with those of transcriptionally largely inactive GOA-like genes shows that the number of in silico predicted TFBS) is similar between paralogs, emphasizing the importance of experimental verification for in silico characterization of TFBS activity and analysis of their evolution. Further, our data show highly conserved expression of Brassicaceae ABS-like genes almost exclusively in the chalazal region of ovules. The Arabidopsis-specific insertion of a transposable element (TE) into the ABS PURRs is required for stabilizing this spatially restricted expression, while other Brassicaceae achieve chalaza-specific expression without TE insertion. We hypothesize that the chalaza-specific expression of ABS is regulated by cis-regulatory elements provided by the TE.
Assuntos
Proteínas de Arabidopsis , Arabidopsis , Brassica , Brassicaceae , Arabidopsis/metabolismo , Brassicaceae/genética , Brassicaceae/metabolismo , Elementos de DNA Transponíveis , Proteínas de Arabidopsis/genética , Sementes/genética , Brassica/genética , Regulação da Expressão Gênica de PlantasRESUMO
Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions. Here we discuss recent and ongoing efforts to build gene regulatory maps, which aim to characterize the regulatory roles of all sequences in a genome. Many researchers and consortia have identified such regulatory elements using functional assays and evolutionary analyses; we discuss the results, strengths and shortcomings of their approaches. We also discuss new techniques the field can leverage and emerging challenges it will face while striving to build gene regulatory maps of ever-increasing resolution and comprehensiveness.
Assuntos
Regulação da Expressão Gênica , Sequências Reguladoras de Ácido Nucleico , Humanos , Regulação da Expressão Gênica/genética , Genoma Humano/genética , Mapeamento Cromossômico , DNA/genéticaRESUMO
The Evf2 long non-coding RNA directs Dlx5/6 ultraconserved enhancer(UCE)-intrachromosomal interactions, regulating genes across a 27 Mb region on chromosome 6 in mouse developing forebrain. Here, we show that Evf2 long-range gene repression occurs through multi-step mechanisms involving the transcription factor Sox2. Evf2 directly interacts with Sox2, antagonizing Sox2 activation of Dlx5/6UCE, and recruits Sox2 to the Dlx5/6eii shadow enhancer and key Dlx5/6UCE interaction sites. Sox2 directly interacts with Dlx1 and Smarca4, as part of the Evf2 ribonucleoprotein complex, forming spherical subnuclear domains (protein pools, PPs). Evf2 targets Sox2 PPs to one long-range repressed target gene (Rbm28), at the expense of another (Akr1b8). Evf2 and Sox2 shift Dlx5/6UCE interactions towards Rbm28, linking Evf2/Sox2 co-regulated topological control and gene repression. We propose a model that distinguishes Evf2 gene repression mechanisms at Rbm28 (Dlx5/6UCE position) and Akr1b8 (limited Sox2 availability). Genome-wide control of RNPs (Sox2, Dlx and Smarca4) shows that co-recruitment influences Sox2 DNA binding. Together, these data suggest that Evf2 organizes a Sox2 PP subnuclear domain and, through Sox2-RNP sequestration and recruitment, regulates chromosome 6 long-range UCE targeting and activity with genome-wide consequences.
Assuntos
Cromossomos de Mamíferos/genética , Regulação da Expressão Gênica no Desenvolvimento , Prosencéfalo/metabolismo , RNA Longo não Codificante/genética , Fatores de Transcrição SOXB1/genética , Animais , DNA Helicases/genética , DNA Helicases/metabolismo , Elementos Facilitadores Genéticos/genética , Imunofluorescência/métodos , Proteínas de Homeodomínio/genética , Proteínas de Homeodomínio/metabolismo , Hibridização in Situ Fluorescente/métodos , Camundongos Knockout , Camundongos Transgênicos , Proteínas Nucleares/genética , Proteínas Nucleares/metabolismo , Prosencéfalo/embriologia , Ligação Proteica , RNA Longo não Codificante/metabolismo , Ribonucleoproteínas/genética , Ribonucleoproteínas/metabolismo , Fatores de Transcrição SOXB1/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
The discovery of putative transcription factor binding sites (TFBSs) is important for understanding the underlying binding mechanism and cellular functions. Recently, many computational methods have been proposed to jointly account for DNA sequence and shape properties in TFBSs prediction. However, these methods fail to fully utilize the latent features derived from both sequence and shape profiles and have limitation in interpretability and knowledge discovery. To this end, we present a novel Deep Convolution Attention network combining Sequence and Shape, dubbed as D-SSCA, for precisely predicting putative TFBSs. Experiments conducted on 165 ENCODE ChIP-seq datasets reveal that D-SSCA significantly outperforms several state-of-the-art methods in predicting TFBSs, and justify the utility of channel attention module for feature refinements. Besides, the thorough analysis about the contribution of five shapes to TFBSs prediction demonstrates that shape features can improve the predictive power for transcription factors-DNA binding. Furthermore, D-SSCA can realize the cross-cell line prediction of TFBSs, indicating the occupancy of common interplay patterns concerning both sequence and shape across various cell lines. The source code of D-SSCA can be found at https://github.com/MoonLord0525/.