RESUMEN
Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
Asunto(s)
Redes Neurales de la Computación , Factores de Transcripción , Factores de Transcripción/metabolismo , Factores de Transcripción/genética , Sitios de Unión , Algoritmos , Biología Computacional/métodos , Humanos , Aprendizaje Profundo , Unión ProteicaRESUMEN
Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.
Asunto(s)
Aprendizaje Profundo , Factores de Transcripción , Factores de Transcripción/metabolismo , Sitios de Unión , Algoritmos , Biología Computacional/métodos , Humanos , Secuenciación de Inmunoprecipitación de Cromatina/métodos , ADN/metabolismo , ADN/químicaRESUMEN
Evolution of gene expression mediated by cis-regulatory changes is thought to be an important contributor to organismal adaptation, but identifying adaptive cis-regulatory changes is challenging due to the difficulty in knowing the expectation under no positive selection. A new approach for detecting positive selection on transcription factor binding sites (TFBSs) was recently developed, thanks to the application of machine learning in predicting transcription factor (TF) binding affinities of DNA sequences. Given a TFBS sequence from a focal species and the corresponding inferred ancestral sequence that differs from the former at n sites, one can predict the TF-binding affinities of many n-step mutational neighbors of the ancestral sequence and obtain a null distribution of the derived binding affinity, which allows testing whether the binding affinity of the real derived sequence deviates significantly from the null distribution. Applying this test genomically to all experimentally identified binding sites of 3 TFs in humans, a recent study reported positive selection for elevated binding affinities of TFBSs. Here, we show that this genomic test suffers from an ascertainment bias because, even in the absence of positive selection for strengthened binding, the binding affinities of known human TFBSs are more likely to have increased than decreased in evolution. We demonstrate by computer simulation that this bias inflates the false positive rate of the selection test. We propose several methods to mitigate the ascertainment bias and show that almost all previously reported positive selection signals disappear when these methods are applied.
Asunto(s)
Genómica , Factores de Transcripción , Humanos , Factores de Transcripción/metabolismo , Simulación por Computador , Sitios de Unión/genética , Unión ProteicaRESUMEN
Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.
Asunto(s)
ADN , Redes Neurales de la Computación , Unión Proteica , Sitios de Unión , Factores de Transcripción/genéticaRESUMEN
Precise targeting of transcription factor binding sites (TFBSs) is essential to comprehending transcriptional regulatory processes and investigating cellular function. Although several deep learning algorithms have been created to predict TFBSs, the models' intrinsic mechanisms and prediction results are difficult to explain. There is still room for improvement in prediction performance. We present DeepSTF, a unique deep-learning architecture for predicting TFBSs by integrating DNA sequence and shape profiles. We use the improved transformer encoder structure for the first time in the TFBSs prediction approach. DeepSTF extracts DNA higher-order sequence features using stacked convolutional neural networks (CNNs), whereas rich DNA shape profiles are extracted by combining improved transformer encoder structure and bidirectional long short-term memory (Bi-LSTM), and, finally, the derived higher-order sequence features and representative shape profiles are integrated into the channel dimension to achieve accurate TFBSs prediction. Experiments on 165 ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) datasets show that DeepSTF considerably outperforms several state-of-the-art algorithms in predicting TFBSs, and we explain the usefulness of the transformer encoder structure and the combined strategy using sequence features and shape profiles in capturing multiple dependencies and learning essential features. In addition, this paper examines the significance of DNA shape features predicting TFBSs. The source code of DeepSTF is available at https://github.com/YuBinLab-QUST/DeepSTF/.
Asunto(s)
ADN , Redes Neurales de la Computación , Sitios de Unión , Unión Proteica , ADN/genética , ADN/química , Factores de Transcripción/genética , Factores de Transcripción/químicaRESUMEN
BACKGROUND: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon. METHOD: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral. CONCLUSIONS: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.
Asunto(s)
Metilación de ADN , Metilación de ADN/genética , Sitios de Unión , Análisis de Secuencia de ADN/métodos , Islas de CpG , Programas Informáticos , Humanos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Regiones Promotoras GenéticasRESUMEN
Enhancers are critical cis-regulatory elements controlling gene expression during cell development and differentiation. However, genome-wide enhancer characterization has been challenging due to the lack of a well-defined relationship between enhancers and genes. Function-based methods are the gold standard for determining the biological function of cis-regulatory elements; however, these methods have not been widely applied to plants. Here, we applied a massively parallel reporter assay on Arabidopsis to measure enhancer activities across the genome. We identified 4327 enhancers with various combinations of epigenetic modifications distinctively different from animal enhancers. Furthermore, we showed that enhancers differ from promoters in their preference for transcription factors. Although some enhancers are not conserved and overlap with transposable elements forming clusters, enhancers are generally conserved across thousand Arabidopsis accessions, suggesting they are selected under evolution pressure and could play critical roles in the regulation of important genes. Moreover, comparison analysis reveals that enhancers identified by different strategies do not overlap, suggesting these methods are complementary in nature. In sum, we systematically investigated the features of enhancers identified by functional assay in A. thaliana, which lays the foundation for further investigation into enhancers' functional mechanisms in plants.
Asunto(s)
Arabidopsis , Animales , Arabidopsis/genética , Elementos de Facilitación Genéticos/genética , Regiones Promotoras Genéticas/genética , Factores de Transcripción/genética , Epigénesis GenéticaRESUMEN
Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions. Here we discuss recent and ongoing efforts to build gene regulatory maps, which aim to characterize the regulatory roles of all sequences in a genome. Many researchers and consortia have identified such regulatory elements using functional assays and evolutionary analyses; we discuss the results, strengths and shortcomings of their approaches. We also discuss new techniques the field can leverage and emerging challenges it will face while striving to build gene regulatory maps of ever-increasing resolution and comprehensiveness.
Asunto(s)
Regulación de la Expresión Génica , Secuencias Reguladoras de Ácidos Nucleicos , Humanos , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Mapeo Cromosómico , ADN/genéticaRESUMEN
The discovery of putative transcription factor binding sites (TFBSs) is important for understanding the underlying binding mechanism and cellular functions. Recently, many computational methods have been proposed to jointly account for DNA sequence and shape properties in TFBSs prediction. However, these methods fail to fully utilize the latent features derived from both sequence and shape profiles and have limitation in interpretability and knowledge discovery. To this end, we present a novel Deep Convolution Attention network combining Sequence and Shape, dubbed as D-SSCA, for precisely predicting putative TFBSs. Experiments conducted on 165 ENCODE ChIP-seq datasets reveal that D-SSCA significantly outperforms several state-of-the-art methods in predicting TFBSs, and justify the utility of channel attention module for feature refinements. Besides, the thorough analysis about the contribution of five shapes to TFBSs prediction demonstrates that shape features can improve the predictive power for transcription factors-DNA binding. Furthermore, D-SSCA can realize the cross-cell line prediction of TFBSs, indicating the occupancy of common interplay patterns concerning both sequence and shape across various cell lines. The source code of D-SSCA can be found at https://github.com/MoonLord0525/.
Asunto(s)
Sitios de Unión , Biología Computacional/métodos , Proteínas de Unión al ADN/química , Factores de Transcripción/química , Algoritmos , Secuenciación de Inmunoprecipitación de Cromatina , ADN/química , Humanos , Redes Neurales de la Computación , Unión Proteica , Programas Informáticos , Factores de Transcripción/metabolismoRESUMEN
Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.
Asunto(s)
Aprendizaje Profundo , Factores de Transcripción , Humanos , Factores de Transcripción/metabolismo , Factores de Transcripción/genética , Sitios de Unión , Biología Computacional/métodos , Células HeLa , Unión Proteica , Secuenciación de Inmunoprecipitación de Cromatina/métodos , Línea CelularRESUMEN
Endometritis is an inflammatory disease that negatively influences fertility and is common in milk-producing cows. An in vitro model for bovine endometrial inflammation was used to identify enrichment of cis-acting regulatory elements in differentially methylated regions (DMRs) in the genome of in vitro-cultured primary bovine endometrial epithelial cells (bEECs) before and after treatment with lipopolysaccharide (LPS) from E. coli, a key player in the development of endometritis. The enriched regulatory elements contain binding sites for transcription factors with established roles in inflammation and hypoxia including NFKB and Hif-1α. We further showed co-localization of certain enriched cis-acting regulatory motifs including ARNT, Hif-1α, and NRF1. Our results show an intriguing interplay between increased mRNA levels in LPS-treated bEECs of the mRNAs encoding the key transcription factors such as AHR, EGR2, and STAT1, whose binding sites were enriched in the DMRs. Our results demonstrate an extraordinary cis-regulatory complexity in these DMRs having binding sites for both inflammatory and hypoxia-dependent transcription factors. Obtained data using this in vitro model for bacterial-induced endometrial inflammation have provided valuable information regarding key transcription factors relevant for clinical endometritis in both cattle and humans.
Asunto(s)
Metilación de ADN , Endometrio , Células Epiteliales , Lipopolisacáridos , Bovinos , Animales , Femenino , Células Epiteliales/metabolismo , Células Epiteliales/efectos de los fármacos , Endometrio/metabolismo , Endometritis/metabolismo , Endometritis/genética , Sitios de Unión , Células Cultivadas , Factores de Transcripción/metabolismo , Factores de Transcripción/genética , Secuencias Reguladoras de Ácidos NucleicosRESUMEN
Chromatin immunoprecipitation followed by massively parallel DNA sequencing (ChIP-seq) is a central genome-wide method for in vivo analyses of DNA-protein interactions in various cellular conditions. Numerous studies have demonstrated the complex contextual organization of ChIP-seq peak sequences and the presence of binding sites for transcription factors in them. We assessed the dependence of the ChIP-seq peak score on the presence of different contextual signals in the peak sequences by analyzing these sequences from several ChIP-seq experiments using our fully enumerative GPU-based de novo motif discovery method, Argo_CUDA. Analysis revealed sets of significant IUPAC motifs corresponding to the binding sites of the target and partner transcription factors. For these ChIP-seq experiments, multiple regression models were constructed, demonstrating a significant dependence of the peak scores on the presence in the peak sequences of not only highly significant target motifs but also less significant motifs corresponding to the binding sites of the partner transcription factors. A significant correlation was shown between the presence of the target motifs FOXA2 and the partner motifs HNF4G, which found experimental confirmation in the scientific literature, demonstrating the important contribution of the partner transcription factors to the binding of the target transcription factor to DNA and, consequently, their important contribution to the peak score.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Factores de Transcripción , Inmunoprecipitación de Cromatina , Análisis de Secuencia de ADN , Factores de Transcripción/genética , ADN/genéticaRESUMEN
Hibernation consists of alternating torpor-arousal phases, during which animals cope with repetitive hypothermia and ischaemia-reperfusion. Due to limited transcriptomic and methylomic information for facultative hibernators, we here conducted RNA and whole-genome bisulfide sequencing in liver of hibernating Syrian hamster (Mesocricetus auratus). Gene ontology analysis was performed on 844 differentially expressed genes and confirmed the shift in metabolic fuel utilization, inhibition of RNA transcription and cell cycle regulation as found in seasonal hibernators. Additionally, we showed a so far unreported suppression of mitogen-activated protein kinase (MAPK) and protein phosphatase 1 pathways during torpor. Notably, hibernating hamsters showed upregulation of MAPK inhibitors (dual-specificity phosphatases and sproutys) and reduced levels of MAPK-induced transcription factors (TFs). Promoter methylation was found to modulate the expression of genes targeted by these TFs. In conclusion, we document gene regulation between hibernation phases, which may aid the identification of pathways and targets to prevent organ damage in transplantation or ischaemia-reperfusion.
Asunto(s)
Hibernación , Transcriptoma , Animales , Cricetinae , Mesocricetus , Hígado , Perfilación de la Expresión GénicaRESUMEN
Transcription factors (TFs) are essential proteins in regulating the spatiotemporal expression of genes. It is crucial to infer the potential transcription factor binding sites (TFBSs) with high resolution to promote biology and realize precision medicine. Recently, deep learning-based models have shown exemplary performance in the prediction of TFBSs at the base-pair level. However, the previous models fail to integrate nucleotide position information and semantic information without noisy responses. Thus, there is still room for improvement. Moreover, both the inner mechanism and prediction results of these models are challenging to interpret. To this end, the Deep Attentive Encoder-Decoder Neural Network (D-AEDNet) is developed to identify the location of TFs-DNA binding sites in DNA sequences. In particular, our model adopts Skip Architecture to leverage the nucleotide position information in the encoder and removes noisy responses in the information fusion process by Attention Gate. Simultaneously, the Transcription Factor Motif Discovery based on Sliding Window (TF-MoDSW), an approach to discover TFs-DNA binding motifs by utilizing the output of neural networks, is proposed to understand the biological meaning of the predicted result. On ChIP-exo datasets, experimental results show that D-AEDNet has better performance than competing methods. Besides, we authenticate that Attention Gate can improve the interpretability of our model by ways of visualization analysis. Furthermore, we confirm that ability of D-AEDNet to learn TFs-DNA binding motifs outperform the state-of-the-art methods and availability of TF-MoDSW to discover biological sequence motifs in TFs-DNA interaction by conducting experiment on ChIP-seq datasets.
Asunto(s)
Aprendizaje Profundo , Factores de Transcripción/metabolismo , Sitios de Unión , Inmunoprecipitación de Cromatina , Biología Computacional/métodos , Unión ProteicaRESUMEN
With the accumulation of ChIP-seq data, convolution neural network (CNN)-based methods have been proposed for predicting transcription factor binding sites (TFBSs). However, biological experimental data are noisy, and are often treated as ground truth for both training and testing. Particularly, existing classification methods ignore the false positive and false negative which are caused by the error in the peak calling stage, and therefore, they can easily overfit to biased training data. It leads to inaccurate identification and inability to reveal the rules of governing protein-DNA binding. To address this issue, we proposed a meta learning-based CNN method (namely TFBS_MLCNN or MLCNN for short) for suppressing the influence of noisy labels data and accurately recognizing TFBSs from ChIP-seq data. Guided by a small amount of unbiased meta-data, MLCNN can adaptively learn an explicit weighting function from ChIP-seq data and update the parameter of classifier simultaneously. The weighting function overcomes the influence of biased training data on classifier by assigning a weight to each sample according to its training loss. The experimental results on 424 ChIP-seq datasets show that MLCNN not only outperforms other existing state-of-the-art CNN methods, but can also detect noisy samples which are given the small weights to suppress them. The suppression ability to the noisy samples can be revealed through the visualization of samples' weights. Several case studies demonstrate that MLCNN has superior performance to others.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Redes Neurales de la Computación , Sitios de Unión , Unión Proteica , Factores de Transcripción/genética , Factores de Transcripción/metabolismoRESUMEN
Lung cancer is a severe and the leading cause of cancer related deaths worldwide. The recurrent h-TERT promoter mutations have been implicated in various cancer types. Thus, the present study is extended to analyze h-TERT promoter mutations from the North Indian lung carcinoma patients. Total 20 histopathologically and clinically confirmed cases of lung cancer were enrolled in this study. The genomic DNA was extracted from venous blood and subjected to amplification using appropriate h-TERT promoter primers. Amplified PCR products were subjected for DNA Sanger sequencing for the identification of novel h-TERT mutations. Further, these identified h-TERT promoter mutations were analysed for the prediction of pathophysiological consequences using bioinformatics tools such as Tfsitescan and CIIDER. The average age of patients was 45 ± 8 years which was categorized in early onset of lung cancer with predominance of male patients by 5.6 fold. Interestingly, h-TERT promoter mutations were observed highly frequent in lung cancer. Identified mutations include c. G272A, c. T122A, c. C150A, c. 123 del C, c. C123T, c. G105A, c. 107 Ins A, c. 276 del C corresponding to -168 G>A, -18 T>A, -46 C>A, -19 del C, -19 C>T, -1 G>A, -3 Ins A, -172 del C respectively from the translation start site in the promoter of the telomerase reverse transcriptase gene which are the first time reported in germline genome from lung cancer. Strikingly, c. -18 T>A [C.T122A] was found the most prevalent variant with 75% frequency. Notwithstanding, other mutations viz c. -G168A [c. G272A] and c. -1 G>A [c. G105A] were found to be at 35% and 15% frequency respectively whilst the rest of the mutations were present at 10% and 5% frequency. Additionally, bioinformatics analysis revealed that these mutations can lead to either loss or gain of various transcription factor binding sites in the h-TERT promoter region. Henceforth, these mutations may play a pivotal role in h-TERT gene expression. Taken together, these identified novel promoter mutations may alter the epigenetics and subsequently various transcription factor binding sites which are of great functional significance. Thereby, it is plausible that these germline mutations may involve either as predisposing factor or direct participation in the pathophysiology of lung cancer through entangled molecular mechanisms.
RESUMEN
BACKGROUND: Mouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, the characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. RESULTS: Here, by integrating binding peaks in ~ 9,000 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into a cis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC set. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strong evolutionary constraints, indicating that they are likely cis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikely cis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. CONCLUSIONS: Validation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
Asunto(s)
Genoma Humano , Elementos Reguladores de la Transcripción , Humanos , Ratones , Animales , Elementos Reguladores de la Transcripción/genética , Sitios de Unión/genética , Unión Proteica , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Mamíferos/genéticaRESUMEN
BACKGROUND: The identification of open chromatin regions and transcription factor binding sites (TFBs) is an important step in understanding the regulation of gene expression in diverse species. ATAC-seq is a technique used for such purpose by providing high-resolution measurements of chromatin accessibility revealed through integration of Tn5 transposase. However, the existence of cell walls in filamentous fungi and associated difficulty in purifying nuclei have precluded the routine application of this technique, leading to a lack of experimentally determined and computationally inferred data on the identity of genome-wide cis-regulatory elements (CREs) and TFBs. In this study, we constructed an ATAC-seq platform suitable for filamentous fungi and generated ATAC-seq libraries of Aspergillus niger and Aspergillus oryzae grown under a variety of conditions. RESULTS: We applied the ATAC-seq assay for filamentous fungi to delineate the syntenic orthologue and differentially changed chromatin accessibility regions among different Aspergillus species, during different culture conditions, and among specific TF-deleted strains. The syntenic orthologues of accessible regions were responsible for the conservative functions across Aspergillus species, while regions differentially changed between culture conditions and TFs mutants drove differential gene expression programs. Importantly, we suggest criteria to determine TFBs through the analysis of unbalanced cleavage of distinct TF-bound DNA strands by Tn5 transposase. Based on this criterion, we constructed data libraries of the in vivo genomic footprint of A. niger under distinct conditions, and generated a database of novel transcription factor binding motifs through comparison of footprints in TF-deleted strains. Furthermore, we validated the novel TFBs in vivo through an artificial synthetic minimal promoter system. CONCLUSIONS: We characterized the chromatin accessibility regions of filamentous fungi species, and identified a complete TFBs map by ATAC-seq, which provides valuable data for future analyses of transcriptional regulation in filamentous fungi.
Asunto(s)
Cromatina , Secuenciación de Nucleótidos de Alto Rendimiento , Aspergillus/genética , Sitios de Unión , Cromatina/genética , Genoma Fúngico , Análisis de Secuencia de ADN , Factores de Transcripción/genéticaRESUMEN
(1) Background: The widespread application of ChIP-seq technology requires annotation of cis-regulatory modules through the search of co-occurred motifs. (2) Methods: We present the web server Motifs Co-Occurrence Tool (Web-MCOT) that for a single ChIP-seq dataset detects the composite elements (CEs) or overrepresented homo- and heterotypic pairs of motifs with spacers and overlaps, with any mutual orientations, uncovering various similarities to recognition models within pairs of motifs. The first (Anchor) motif in CEs respects the target transcription factor of the ChIP-seq experiment, while the second one (Partner) can be defined either by a user or a public library of Partner motifs being processed. (3) Results: Web-MCOT computes the significances of CEs without reference to motif conservation and those with more conserved Partner and Anchor motifs. Graphic results show histograms of CE abundance depending on orientations of motifs, overlap and spacer lengths; logos of the most common CE structural types with an overlap of motifs, and heatmaps depicting the abundance of CEs with one motif possessing higher conservation than another. (4) Conclusions: Novel capacities of Web-MCOT allow retrieving from a single ChIP-seq dataset with maximal information on the co-occurrence of motifs and potentiates planning of next ChIP-seq experiments.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Factores de Transcripción , Sitios de Unión , Inmunoprecipitación de Cromatina/métodos , Factores de Transcripción/genéticaRESUMEN
BACKGROUND: Transcription factors (TFs) bind specifically to TF binding sites (TFBSs) at cis-regulatory regions to control transcription. It is critical to locate these TF-DNA interactions to understand transcriptional regulation. Efforts to predict bona fide TFBSs benefit from the availability of experimental data mapping DNA binding regions of TFs (chromatin immunoprecipitation followed by sequencing - ChIP-seq). RESULTS: In this study, we processed ~ 10,000 public ChIP-seq datasets from nine species to provide high-quality TFBS predictions. After quality control, it culminated with the prediction of ~ 56 million TFBSs with experimental and computational support for direct TF-DNA interactions for 644 TFs in > 1000 cell lines and tissues. These TFBSs were used to predict > 197,000 cis-regulatory modules representing clusters of binding events in the corresponding genomes. The high-quality of the TFBSs was reinforced by their evolutionary conservation, enrichment at active cis-regulatory regions, and capacity to predict combinatorial binding of TFs. Further, we confirmed that the cell type and tissue specificity of enhancer activity was correlated with the number of TFs with binding sites predicted in these regions. All the data is provided to the community through the UniBind database that can be accessed through its web-interface ( https://unibind.uio.no/ ), a dedicated RESTful API, and as genomic tracks. Finally, we provide an enrichment tool, available as a web-service and an R package, for users to find TFs with enriched TFBSs in a set of provided genomic regions. CONCLUSIONS: UniBind is the first resource of its kind, providing the largest collection of high-confidence direct TF-DNA interactions in nine species.