RESUMO
Non-coding variants associated with complex traits can alter the motifs of transcription factor (TF)-deoxyribonucleic acid binding. Although many computational models have been developed to predict the effects of non-coding variants on TF binding, their predictive power lacks systematic evaluation. Here we have evaluated 14 different models built on position weight matrices (PWMs), support vector machines, ordinary least squares and deep neural networks (DNNs), using large-scale in vitro (i.e. SNP-SELEX) and in vivo (i.e. allele-specific binding, ASB) TF binding data. Our results show that the accuracy of each model in predicting SNP effects in vitro significantly exceeds that achieved in vivo. For in vitro variant impact prediction, kmer/gkm-based machine learning methods (deltaSVM_HT-SELEX, QBiC-Pred) trained on in vitro datasets exhibit the best performance. For in vivo ASB variant prediction, DNN-based multitask models (DeepSEA, Sei, Enformer) trained on the ChIP-seq dataset exhibit relatively superior performance. Among the PWM-based methods, tRap demonstrates better performance in both in vitro and in vivo evaluations. In addition, we find that TF classes such as basic leucine zipper factors could be predicted more accurately, whereas those such as C2H2 zinc finger factors are predicted less accurately, aligning with the evolutionary conservation of these TF classes. We also underscore the significance of non-sequence factors such as cis-regulatory element type, TF expression, interactions and post-translational modifications in influencing the in vivo predictive performance of TFs. Our research provides valuable insights into selecting prioritization methods for non-coding variants and further optimizing such models.
Assuntos
Polimorfismo de Nucleotídeo Único , Fatores de Transcrição , Sítios de Ligação/genética , Ligação Proteica/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , DNA/genéticaRESUMO
Chromatin features can reveal tissue-specific TF-DNA binding, which leads to a better understanding of many critical physiological processes. Accurately identifying TF-DNA bindings and constructing their relationships with chromatin features is a long-standing goal in the bioinformatic field. However, this has remained elusive due to the complex binding mechanisms and heterogeneity among inputs. Here, we have developed the GHTNet (General Hybrid Transformer Network), a transformer-based model to predict TF-DNA binding specificity. The GHTNet decodes the relationship between tissue-specific TF-DNA binding and chromatin features via a specific input scheme of alternative inputs and reveals important gene regions and tissue-specific motifs. Our experiments show that the GHTNet has excellent performance, achieving about a 5% absolute improvement over existing methods. The TF-DNA binding mechanism analysis shows that the importance of TF-DNA binding features varies across tissues. The best predictor is based on the DNA sequence, followed by epigenomics and shape. In addition, cross-species studies address the limited data, thus providing new ideas in this case. Moreover, the GHTNet is applied to interpret the relationship among TFs, chromatin features, and diseases associated with AD46 tissue. This paper demonstrates that the GHTNet is an accurate and robust framework for deciphering tissue-specific TF-DNA binding and interpreting non-coding regions.
Assuntos
Cromatina , Fatores de Transcrição , Cromatina/genética , Sítios de Ligação/genética , Fatores de Transcrição/genética , Ligação Proteica , DNA/genética , DNA/metabolismoRESUMO
Transcription factors (TFs) can regulate gene expression by recognizing specific cis-regulatory elements in DNA sequences. TF-DNA binding prediction has become a fundamental step in comprehending the underlying cis-regulation mechanism. Since a particular genome region is bound depending on multiple features, such as the arrangement of nucleotides, DNA shape, and an epigenetic mechanism, many researchers attempt to develop computational methods to predict TF binding sites (TFBSs) based on various genomic features. This paper provides a comprehensive compendium to better understand TF-DNA binding from genomic features. We first summarize the commonly used datasets and data processing manners. Subsequently, we classify current deep learning methods in TFBS prediction according to their utilized genomic features and analyze each technique's merit and weakness. Furthermore, we illustrate the functional consequences characterization of TF-DNA binding by prioritizing noncoding variants in identified motif instances. Finally, the challenges and opportunities of deep learning in TF-DNA binding prediction are discussed. This survey can bring valuable insights for researchers to study the modeling of TF-DNA binding.
Assuntos
Biologia Computacional , Genômica , Sítios de Ligação , Biologia Computacional/métodos , DNA/química , DNA/genética , Nucleotídeos/metabolismo , Ligação Proteica , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
The position weight matrix, also called the position-specific scoring matrix, is the commonly accepted model to quantify the specificity of transcription factor binding to DNA. Position weight matrices are used in thousands of projects and software tools in regulatory genomics, including computational prediction of the regulatory impact of single-nucleotide variants. Yet, recently Yan et al. reported that "the position weight matrices of most transcription factors lack sufficient predictive power" if applied to the analysis of regulatory variants studied with a newly developed experimental method, SNP-SELEX. Here, we re-analyze the rich experimental dataset obtained by Yan et al. and show that appropriately selected position weight matrices in fact can adequately quantify transcription factor binding to alternative alleles.
Assuntos
Software , Fatores de Transcrição , Sítios de Ligação/genética , Matrizes de Pontuação de Posição Específica , Ligação Proteica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
Transcription factors (TFs) bind DNA in a sequence-specific manner and thereby regulate target gene expression. TF binding and its regulatory activity is highly context dependent, and is not only determined by specific cell types or differentiation stages but also relies on other regulatory mechanisms, such as DNA and chromatin modifications. Interactions between TFs and their DNA binding sites are critical mediators of phenotypic variation and play important roles in the onset of disease. A continuously growing number of studies therefore attempts to elucidate TF:DNA interactions to gain knowledge about regulatory mechanisms and disease-causing variants. Here we summarize how TF-binding characteristics and the impact of variants can be investigated, how bioinformatic tools can be used to analyze and predict TF:DNA binding, and what additional information can be obtained from the TF protein structure.
RESUMO
Many recent studies have emphasized the importance of genetic variants and mutations in cancer and other complex human diseases. The overwhelming majority of these variants occur in non-coding portions of the genome, where they can have a functional impact by disrupting regulatory interactions between transcription factors (TFs) and DNA. Here, we present a method for assessing the impact of non-coding mutations on TF-DNA interactions, based on regression models of DNA-binding specificity trained on high-throughput in vitro data. We use ordinary least squares (OLS) to estimate the parameters of the binding model for each TF, and we show that our predictions of TF-binding changes due to DNA mutations correlate well with measured changes in gene expression. In addition, by leveraging distributional results associated with OLS estimation, for each predicted change in TF binding we also compute a normalized score (z-score) and a significance value (p-value) reflecting our confidence that the mutation affects TF binding. We use this approach to analyze a large set of pathogenic non-coding variants, and we show that these variants lead to significant differences in TF binding between alleles, compared to a control set of common variants. Thus, our results indicate that there is a strong regulatory component to the pathogenic non-coding variants identified thus far.