RESUMO
Predicting protein-DNA binding specificity is a challenging yet essential task for understanding gene regulation. Protein-DNA complexes usually exhibit binding to a selected DNA target site, whereas a protein binds, with varying degrees of binding specificity, to a wide range of DNA sequences. This information is not directly accessible in a single structure. Here, to access this information, we present Deep Predictor of Binding Specificity (DeepPBS), a geometric deep-learning model designed to predict binding specificity from protein-DNA structure. DeepPBS can be applied to experimental or predicted structures. Interpretable protein heavy atom importance scores for interface residues can be extracted. When aggregated at the protein residue level, these scores are validated through mutagenesis experiments. Applied to designed proteins targeting specific DNA sequences, DeepPBS was demonstrated to predict experimentally measured binding specificity. DeepPBS offers a foundation for machine-aided studies that advance our understanding of molecular interactions and guide experimental designs and synthetic biology.
Assuntos
Proteínas de Ligação a DNA , DNA , Aprendizado Profundo , Ligação Proteica , DNA/metabolismo , DNA/química , Proteínas de Ligação a DNA/metabolismo , Proteínas de Ligação a DNA/química , Sítios de Ligação , Biologia Computacional/métodos , Modelos MolecularesRESUMO
Development of the malaria parasite, Plasmodium falciparum, is regulated by a limited number of sequence-specific transcription factors (TFs). However, the mechanisms by which these TFs recognize genome-wide binding sites is largely unknown. To address TF specificity, we investigated the binding of two TF subsets that either bind CACACA or GTGCAC DNA sequence motifs and further characterized two additional ApiAP2 TFs, PfAP2-G and PfAP2-EXP, which bind unique DNA motifs (GTAC and TGCATGCA). We also interrogated the impact of DNA sequence and chromatin context on P. falciparum TF binding by integrating high-throughput in vitro and in vivo binding assays, DNA shape predictions, epigenetic post-translational modifications, and chromatin accessibility. We found that DNA sequence context minimally impacts binding site selection for paralogous CACACA-binding TFs, while chromatin accessibility, epigenetic patterns, co-factor recruitment, and dimerization correlate with differential binding. In contrast, GTGCAC-binding TFs prefer different DNA sequence context in addition to chromatin dynamics. Finally, we determined that TFs that preferentially bind divergent DNA motifs may bind overlapping genomic regions due to low-affinity binding to other sequence motifs. Our results demonstrate that TF binding site selection relies on a combination of DNA sequence and chromatin features, thereby contributing to the complexity of P. falciparum gene regulatory mechanisms.
Assuntos
Cromatina , Motivos de Nucleotídeos , Plasmodium falciparum , Ligação Proteica , Proteínas de Protozoários , Fatores de Transcrição , Plasmodium falciparum/genética , Plasmodium falciparum/metabolismo , Cromatina/metabolismo , Cromatina/genética , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Sítios de Ligação , Humanos , Proteínas de Protozoários/metabolismo , Proteínas de Protozoários/genética , Proteínas de Protozoários/química , Malária Falciparum/parasitologia , Sequência de Bases , DNA/metabolismo , DNA/química , Epigênese Genética , DNA de Protozoário/metabolismo , DNA de Protozoário/genéticaRESUMO
DNA-binding proteins play important roles in various cellular processes, but the mechanisms by which proteins recognize genomic target sites remain incompletely understood. Functional groups at the edges of the base pairs (bp) exposed in the DNA grooves represent physicochemical signatures. As these signatures enable proteins to form specific contacts between protein residues and bp, their study can provide mechanistic insights into protein-DNA binding. Existing experimental methods, such as X-ray crystallography, can reveal such mechanisms based on physicochemical interactions between proteins and their DNA target sites. However, the low throughput of structural biology methods limits mechanistic insights for selection of many genomic sites. High-throughput binding assays enable prediction of potential target sites by determining relative binding affinities of a protein to massive numbers of DNA sequences. Many currently available computational methods are based on the sequence of standard Watson-Crick bp. They assume that the contribution of overall binding affinity is independent for each base pair, or alternatively include dinucleotides or short k-mers. These methods cannot directly expand to physicochemical contacts, and they are not suitable to apply to DNA modifications or non-Watson-Crick bp. These variations include DNA methylation, and synthetic or mismatched bp. The proposed method, DeepRec, can predict relative binding affinities as function of physicochemical signatures and the effect of DNA methylation or other chemical modifications on binding. Sequence-based modeling methods are in comparison a coarse-grain description and cannot achieve such insights. Our chemistry-based modeling framework provides a path towards understanding genome function at a mechanistic level.
Assuntos
Proteínas de Ligação a DNA , DNA , Pareamento de Bases , DNA/metabolismo , Ligação Proteica , Proteínas de Ligação a DNA/metabolismo , Sítios de LigaçãoRESUMO
DNA recognition and targeting by transcription factors (TFs) through specific binding are fundamental in biological processes. Furthermore, the histidine protonation state at the TF-DNA binding interface can significantly influence the binding mechanism of TF-DNA complexes. Nevertheless, the role of histidine in TF-DNA complexes remains underexplored. Here, we employed all-atom molecular dynamics simulations using AlphaFold2-modeled complexes based on previously solved co-crystal structures to probe the role of the His-12 residue in the Extradenticle (Exd)-Sex combs reduced (Scr)-DNA complex when binding to Scr and Ultrabithorax (Ubx) target sites. Our results demonstrate that the protonation state of histidine notably affected the DNA minor-groove width profile and binding free energy. Examining flanking sequences of various binding affinities derived from SELEX-seq experiments, we analyzed the relationship between binding affinity and specificity. We uncovered how histidine protonation leads to increased binding affinity but can lower specificity. Our findings provide new mechanistic insights into the role of histidine in modulating TF-DNA binding.
Assuntos
Proteínas de Drosophila , Proteínas de Homeodomínio , Animais , Proteínas de Homeodomínio/genética , Histidina , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/metabolismo , DNA/química , Sítios de Ligação , Fatores de Transcrição/metabolismoRESUMO
SUMMARY: Several high-throughput protein-DNA binding methods currently available produce highly reproducible measurements of binding affinity at the level of the k-mer. However, understanding where a k-mer is positioned along a binding site sequence depends on alignment. Here, we present Top-Down Crawl (TDC), an ultra-rapid tool designed for the alignment of k-mer level data in a rank-dependent and position weight matrix (PWM)-independent manner. As the framework only depends on the rank of the input, the method can accept input from many types of experiments (protein binding microarray, SELEX-seq, SMiLE-seq, etc.) without the need for specialized parameterization. Measuring the performance of the alignment using multiple linear regression with 5-fold cross-validation, we find TDC to perform as well as or better than computationally expensive PWM-based methods. AVAILABILITY AND IMPLEMENTATION: TDC can be run online at https://topdowncrawl.usc.edu or locally as a python package available through pip at https://pypi.org/project/TopDownCrawl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Software , Matrizes de Pontuação de Posição Específica , Sítios de Ligação , Análise de Sequência de DNA/métodos , Ligação ProteicaRESUMO
Phosphorothioate (PT) DNA modifications-in which a nonbonding phosphate oxygen is replaced with sulfur-represent a widespread, horizontally transferred epigenetic system in prokaryotes and have a highly unusual property of occupying only a small fraction of available consensus sequences in a genome. Using Salmonella enterica as a model, we asked a question of fundamental importance: How do the PT-modifying DndA-E proteins select their GPSAAC/GPSTTC targets? Here, we applied innovative analytical, sequencing, and computational tools to discover a novel behavior for DNA-binding proteins: The Dnd proteins are "parked" at the G6mATC Dam methyltransferase consensus sequence instead of the expected GAAC/GTTC motif, with removal of the 6mA permitting extensive PT modification of GATC sites. This shift in modification sites further revealed a surprising constancy in the density of PT modifications across the genome. Computational analysis showed that GAAC, GTTC, and GATC share common features of DNA shape, which suggests that PT epigenetics are regulated in a density-dependent manner partly by DNA shape-driven target selection in the genome.
Assuntos
Bactérias/genética , Bactérias/metabolismo , DNA Bacteriano/metabolismo , Epigênese Genética/fisiologia , Epigenômica , Fosfatos/metabolismo , 2-Aminopurina , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Sequência de Bases , Sítios de Ligação , Sequência Consenso , DNA Bacteriano/química , DNA Bacteriano/genética , Proteínas de Ligação a DNA/metabolismo , Escherichia coli/metabolismo , Genoma Bacteriano , Salmonella enterica/genéticaRESUMO
TFBSshape (https://tfbsshape.usc.edu) is a motif database for analyzing structural profiles of transcription factor binding sites (TFBSs). The main rationale for this database is to be able to derive mechanistic insights in protein-DNA readout modes from sequencing data without available structures. We extended the quantity and dimensionality of TFBSshape, from mostly in vitro to in vivo binding and from unmethylated to methylated DNA. This new release of TFBSshape improves its functionality and launches a responsive and user-friendly web interface for easy access to the data. The current expansion includes new entries from the most recent collections of transcription factors (TFs) from the JASPAR and UniPROBE databases, methylated TFBSs derived from in vitro high-throughput EpiSELEX-seq binding assays and in vivo methylated TFBSs from the MeDReaders database. TFBSshape content has increased to 2428 structural profiles for 1900 TFs from 39 different species. The structural profiles for each TFBS entry now include 13 shape features and minor groove electrostatic potential for standard DNA and four shape features for methylated DNA. We improved the flexibility and accuracy for the shape-based alignment of TFBSs and designed new tools to compare methylated and unmethylated structural profiles of TFs and methods to derive DNA shape-preserving nucleotide mutations in TFBSs.
Assuntos
DNA/química , Bases de Dados Genéticas , Fatores de Transcrição/metabolismo , Sítios de Ligação , DNA/metabolismo , Metilação de DNA , Mutação , Motivos de Nucleotídeos , Ligação Proteica , Análise de Sequência de DNARESUMO
Recognition of DNA by proteins depends on DNA sequence and structure. Often unanswered is whether the structure of naked DNA persists in a protein-DNA complex, or whether protein binding changes DNA shape. While X-ray structures of protein-DNA complexes are numerous, the structure of naked cognate DNA is seldom available experimentally. We present here an experimental and computational analysis pipeline that uses hydroxyl radical cleavage to map, at single-nucleotide resolution, DNA minor groove width, a recognition feature widely exploited by proteins. For 11 protein-DNA complexes, we compared experimental maps of naked DNA minor groove width with minor groove width measured from X-ray co-crystal structures. Seven sites had similar minor groove widths as naked DNA and when bound to protein. For four sites, part of the DNA in the complex had the same structure as naked DNA, and part changed structure upon protein binding. We compared the experimental map with minor groove patterns of DNA predicted by two computational approaches, DNAshape and ORChID2, and found good but not perfect concordance with both. This experimental approach will be useful in mapping structures of DNA sequences for which high-resolution structural data are unavailable. This approach allows probing of protein family-dependent readout mechanisms.
Assuntos
Proteínas de Ligação a DNA/metabolismo , DNA/química , Sítios de Ligação , DNA/metabolismo , Modelos Moleculares , Conformação de Ácido Nucleico , Nucleotídeos/química , Ligação ProteicaRESUMO
Uncovering the mechanisms that affect the binding specificity of transcription factors (TFs) is critical for understanding the principles of gene regulation. Although sequence-based models have been used successfully to predict TF binding specificities, we found that including DNA shape information in these models improved their accuracy and interpretability. Previously, we developed a method for modeling DNA binding specificities based on DNA shape features extracted from Monte Carlo (MC) simulations. Prediction accuracies of our models, however, have not yet been compared to accuracies of models incorporating DNA shape information extracted from X-ray crystallography (XRC) data or Molecular Dynamics (MD) simulations. Here, we integrated DNA shape information extracted from MC or MD simulations and XRC data into predictive models of TF binding and compared their performance. Models that incorporated structural information consistently showed improved performance over sequence-based models regardless of data source. Furthermore, we derived and validated nine additional DNA shape features beyond our original set of four features. The expanded repertoire of 13 distinct DNA shape features, including six intra-base pair and six inter-base pair parameters and minor groove width, is available in our R/Bioconductor package DNAshapeR and enables a comprehensive structural description of the double helix on a genome-wide scale.
Assuntos
Algoritmos , Biologia Computacional/métodos , DNA/química , Estudo de Associação Genômica Ampla/métodos , Fatores de Transcrição/química , Sequência de Bases , Cristalografia por Raios X , DNA/genética , DNA/metabolismo , Simulação de Dinâmica Molecular , Método de Monte Carlo , Conformação de Ácido Nucleico , Ligação Proteica , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismoRESUMO
Protein-DNA binding is a fundamental component of gene regulatory processes, but it is still not completely understood how proteins recognize their target sites in the genome. Besides hydrogen bonding in the major groove (base readout), proteins recognize minor-groove geometry using positively charged amino acids (shape readout). The underlying mechanism of DNA shape readout involves the correlation between minor-groove width and electrostatic potential (EP). To probe this biophysical effect directly, rather than using minor-groove width as an indirect measure for shape readout, we developed a methodology, DNAphi, for predicting EP in the minor groove and confirmed the direct role of EP in protein-DNA binding using massive sequencing data. The DNAphi method uses a sliding-window approach to mine results from non-linear Poisson-Boltzmann (NLPB) calculations on DNA structures derived from all-atom Monte Carlo simulations. We validated this approach, which only requires nucleotide sequence as input, based on direct comparison with NLPB calculations for available crystal structures. Using statistical machine-learning approaches, we showed that adding EP as a biophysical feature can improve the predictive power of quantitative binding specificity models across 27 transcription factor families. High-throughput prediction of EP offers a novel way to integrate biophysical and genomic studies of protein-DNA binding.
Assuntos
Proteínas de Ligação a DNA/metabolismo , DNA/química , Fatores de Transcrição/metabolismo , Sítios de Ligação , DNA/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Escherichia coli/metabolismo , Fator Proteico para Inversão de Estimulação/metabolismo , Genoma , Genômica , Proteínas de Homeodomínio/metabolismo , Aprendizado de Máquina , Modelos Moleculares , Método de Monte Carlo , Conformação de Ácido Nucleico , Fosfatos/química , Ligação Proteica , Eletricidade Estática , Fatores de Transcrição/químicaRESUMO
UNLABELLED: DNAshapeR predicts DNA shape features in an ultra-fast, high-throughput manner from genomic sequencing data. The package takes either nucleotide sequence or genomic coordinates as input and generates various graphical representations for visualization and further analysis. DNAshapeR further encodes DNA sequence and shape features as user-defined combinations of k-mer and DNA shape features. The resulting feature matrices can be readily used as input of various machine learning software packages for further modeling studies. AVAILABILITY AND IMPLEMENTATION: The DNAshapeR software package was implemented in the statistical programming language R and is freely available through the Bioconductor project at https://www.bioconductor.org/packages/devel/bioc/html/DNAshapeR.html and at the GitHub developer site, http://tsupeichiu.github.io/DNAshapeR/ CONTACT: rohs@usc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
DNA , Genômica , Software , Genoma , Linguagens de ProgramaçãoRESUMO
Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.
Assuntos
DNA/química , Bases de Dados de Ácidos Nucleicos , Genoma , Anotação de Sequência Molecular , Navegador , Animais , Sítios de Ligação , Humanos , Conformação de Ácido Nucleico , Nucleossomos/metabolismo , Sítio de Iniciação de TranscriçãoRESUMO
Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA structure, also described as DNA shape, plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k-mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, DNA structural features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing an understanding of the effects of flanking regions on DNA structure in a target region of a sequence. The Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as versatile and powerful tool for diverse DNA structure-related studies.
Assuntos
Aprendizado Profundo , Proteínas/metabolismo , Ligação Proteica , Aprendizado de Máquina , DNA/metabolismoRESUMO
Circadian clock genes are emerging targets in many types of cancer, but their mechanistic contributions to tumor progression are still largely unknown. This makes it challenging to stratify patient populations and develop corresponding treatments. In this work, we show that in breast cancer, the disrupted expression of circadian genes has the potential to serve as biomarkers. We also show that the master circadian transcription factors (TFs) BMAL1 and CLOCK are required for the proliferation of metastatic mesenchymal stem-like (mMSL) triple-negative breast cancer (TNBC) cells. Using currently available small molecule modulators, we found that a stabilizer of cryptochrome 2 (CRY2), the direct repressor of BMAL1 and CLOCK transcriptional activity, synergizes with inhibitors of proteasome, which is required for BMAL1 and CLOCK function, to repress a transcriptional program comprising circadian cycling genes in mMSL TNBC cells. Omics analyses on drug-treated cells implied that this repression of transcription is mediated by the transcription factor binding sites (TFBSs) features in the cis-regulatory elements (CRE) of clock-controlled genes. Through a massive parallel reporter assay, we defined a set of CRE features that are potentially repressed by the specific drug combination. The identification of cis -element enrichment might serve as a new concept of defining and targeting tumor types through the modulation of cis -regulatory programs, and ultimately provide a new paradigm of therapy design for cancer types with unclear drivers like TNBC.
RESUMO
Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA shape plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k -mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, refined DNA shape features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing a deeper understanding of the effects of flanking regions on DNA shape in a target region of a sequence. Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as a versatile and powerful tool for diverse DNA structure-related studies.
RESUMO
Predicting specificity in protein-DNA interactions is a challenging yet essential task for understanding gene regulation. Here, we present Deep Predictor of Binding Specificity (DeepPBS), a geometric deep-learning model designed to predict binding specificity across protein families based on protein-DNA structures. The DeepPBS architecture allows investigation of different family-specific recognition patterns. DeepPBS can be applied to predicted structures, and can aid in the modeling of protein-DNA complexes. DeepPBS is interpretable and can be used to calculate protein heavy atom-level importance scores, demonstrated as a case-study on p53-DNA interface. When aggregated at the protein residue level, these scores conform well with alanine scanning mutagenesis experimental data. The inference time for DeepPBS is sufficiently fast for analyzing simulation trajectories, as demonstrated on a molecular-dynamics simulation of a Drosophila Hox-DNA tertiary complex with its cofactor. DeepPBS and its corresponding data resources offer a foundation for machine-aided protein-DNA interaction studies, guiding experimental choices and complex design, as well as advancing our understanding of molecular interactions.
RESUMO
Macrophages display phenotypic plasticity and can be induced by hepatitis B virus (HBV) to undergo either M1-like pro-inflammatory or M2-like anti-inflammatory polarization. Here, we report that M1-like macrophages stimulated by HBV exhibit a strong HBV-suppressive effect, which is diminished in M2-like macrophages. Transcriptomic analysis reveals that HBV induces the expression of interleukin-1ß (IL-1ß) in M1-like macrophages, which display a high oxidative phosphorylation (OXPHOS) activity distinct from that of conventional M1-like macrophages. Further analysis indicates that OXPHOS attenuates the expression of IL-1ß, which suppresses the expression of peroxisome proliferator-activated receptor α (PPARα) and forkhead box O3 (FOXO3) in hepatocytes to suppress HBV gene expression and replication. Moreover, multiple HBV proteins can induce the expression of IL-1ß in macrophages. Our results thus indicate that macrophages can respond to HBV by producing IL-1ß to suppress HBV replication. However, HBV can also metabolically reprogram macrophages to enhance OXPHOS to minimize this host antiviral response.
Assuntos
Proteína Forkhead Box O3/imunologia , Hepatite B/imunologia , Interleucina-1beta/imunologia , Macrófagos/imunologia , Macrófagos/virologia , PPAR gama/imunologia , Animais , Regulação para Baixo , Proteína Forkhead Box O3/metabolismo , Vírus da Hepatite B , Interações Hospedeiro-Patógeno/imunologia , Humanos , Interleucina-1beta/metabolismo , Macrófagos/metabolismo , Masculino , Camundongos , Camundongos Endogâmicos C57BL , PPAR gama/metabolismo , Replicação Viral/imunologiaRESUMO
BACKGROUND: DNA shape analysis has demonstrated the potential to reveal structure-based mechanisms of protein-DNA binding. However, information about the influence of chemical modification of DNA is limited. Cytosine methylation, the most frequent modification, represents the addition of a methyl group at the major groove edge of the cytosine base. In mammalian genomes, cytosine methylation most frequently occurs at CpG dinucleotides. In addition to changing the chemical signature of C/G base pairs, cytosine methylation can affect DNA structure. Since the original discovery of DNA methylation, major efforts have been made to understand its effect from a sequence perspective. Compared to unmethylated DNA, however, little structural information is available for methylated DNA, due to the limited number of experimentally determined structures. To achieve a better mechanistic understanding of the effect of CpG methylation on local DNA structure, we developed a high-throughput method, methyl-DNAshape, for predicting the effect of cytosine methylation on DNA shape. RESULTS: Using our new method, we found that CpG methylation significantly altered local DNA shape. Four DNA shape features-helix twist, minor groove width, propeller twist, and roll-were considered in this analysis. Distinct distributions of effect size were observed for different features. Roll and propeller twist were the DNA shape features most strongly affected by CpG methylation with an effect size depending on the local sequence context. Methylation-induced changes in DNA shape were predictive of the measured rate of cleavage by DNase I and suggest a possible mechanism for some of the methylation sensitivities that were recently observed for human Pbx-Hox complexes. CONCLUSIONS: CpG methylation is an important epigenetic mark in the mammalian genome. Understanding its role in protein-DNA recognition can further our knowledge of gene regulation. Our high-throughput methyl-DNAshape method can be used to predict the effect of cytosine methylation on DNA shape and its subsequent influence on protein-DNA interactions. This approach overcomes the limited availability of experimental DNA structures that contain 5-methylcytosine.