RESUMO
Transcription factor (TF) binding to DNA is critical to transcription regulation. Although the binding properties of numerous individual TFs are well-documented, a more detailed comprehension of how TFs interact cooperatively with DNA is required. We present COBIND, a novel method based on non-negative matrix factorization (NMF) to identify TF co-binding patterns automatically. COBIND applies NMF to one-hot encoded regions flanking known TF binding sites (TFBSs) to pinpoint enriched DNA patterns at fixed distances. We applied COBIND to 5699 TFBS datasets from UniBind for 401 TFs in seven species. The method uncovered already established co-binding patterns and new co-binding configurations not yet reported in the literature and inferred through motif similarity and protein-protein interaction knowledge. Our extensive analyses across species revealed that 67% of the TFs shared a co-binding motif with other TFs from the same structural family. The co-binding patterns captured by COBIND are likely functionally relevant as they harbor higher evolutionarily conservation than isolated TFBSs. Open chromatin data from matching human cell lines further supported the co-binding predictions. Finally, we used single-molecule footprinting data from mouse embryonic stem cells to confirm that the COBIND-predicted co-binding events associated with some TFs likely occurred on the same DNA molecules.
Assuntos
DNA , Ligação Proteica , Fatores de Transcrição , Fatores de Transcrição/metabolismo , Humanos , Sítios de Ligação , Animais , Camundongos , DNA/metabolismo , DNA/química , Motivos de Nucleotídeos , Cromatina/metabolismo , AlgoritmosRESUMO
JASPAR (https://jaspar.elixir.no/) is a widely-used open-access database presenting manually curated high-quality and non-redundant DNA-binding profiles for transcription factors (TFs) across taxa. In this 10th release and 20th-anniversary update, the CORE collection has expanded with 329 new profiles. We updated three existing profiles and provided orthogonal support for 72 profiles from the previous release's UNVALIDATED collection. Altogether, the JASPAR 2024 update provides a 20% increase in CORE profiles from the previous release. A trimming algorithm enhanced profiles by removing low information content flanking base pairs, which were likely uninformative (within the capacity of the PFM models) for TFBS predictions and modelling TF-DNA interactions. This release includes enhanced metadata, featuring a refined classification for plant TFs' structural DNA-binding domains. The new JASPAR collections prompt updates to the genomic tracks of predicted TF binding sites (TFBSs) in 8 organisms, with human and mouse tracks available as native tracks in the UCSC Genome browser. All data are available through the JASPAR web interface and programmatically through its API and the updated Bioconductor and pyJASPAR packages. Finally, a new TFBS extraction tool enables users to retrieve predicted JASPAR TFBSs intersecting their genomic regions of interest.
Assuntos
Bases de Dados Genéticas , Ligação Proteica , Fatores de Transcrição , Animais , Humanos , Camundongos , Bases de Dados Genéticas/normas , Bases de Dados Genéticas/tendências , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Plantas/genéticaRESUMO
Since high-throughput techniques became a staple in biological science laboratories, computational algorithms, and scientific software have boomed. However, the development of bioinformatics software usually lacks software development quality standards. The resulting software code is hard to test, reuse, and maintain. We believe that the root of inefficiency in implementing the best software development practices in academic settings is the individualistic approach, which has traditionally been the norm for recognizing scientific achievements and, by extension, for developing specialized software. Software development is a collective effort in most software-heavy endeavors. Indeed, the literature suggests teamwork directly impacts code quality through knowledge sharing, collective software development, and established coding standards. In our computational biology research groups, we sustainably involve all group members in learning, sharing, and discussing software development while maintaining the personal ownership of research projects and related software products. We found that group members involved in this endeavor improved their coding skills, became more efficient bioinformaticians, and obtained detailed knowledge about their peers' work, triggering new collaborative projects. We strongly advocate for improving software development culture within bioinformatics through collective effort in computational biology groups or institutes with three or more bioinformaticians.
RESUMO
JASPAR (http://jaspar.genereg.net/) is an open-access database containing manually curated, non-redundant transcription factor (TF) binding profiles for TFs across six taxonomic groups. In this 9th release, we expanded the CORE collection with 341 new profiles (148 for plants, 101 for vertebrates, 85 for urochordates, and 7 for insects), which corresponds to a 19% expansion over the previous release. We added 298 new profiles to the Unvalidated collection when no orthogonal evidence was found in the literature. All the profiles were clustered to provide familial binding profiles for each taxonomic group. Moreover, we revised the structural classification of DNA binding domains to consider plant-specific TFs. This release introduces word clouds to represent the scientific knowledge associated with each TF. We updated the genome tracks of TFBSs predicted with JASPAR profiles in eight organisms; the human and mouse TFBS predictions can be visualized as native tracks in the UCSC Genome Browser. Finally, we provide a new tool to perform JASPAR TFBS enrichment analysis in user-provided genomic regions. All the data is accessible through the JASPAR website, its associated RESTful API, the R/Bioconductor data package, and a new Python package, pyJASPAR, that facilitates serverless access to the data.
Assuntos
Bases de Dados Genéticas , Genômica/classificação , Software , Fatores de Transcrição/genética , Animais , Sítios de Ligação/genética , Biologia Computacional , Genoma/genética , Humanos , Camundongos , Plantas/genética , Ligação Proteica/genética , Fatores de Transcrição/classificação , Vertebrados/genéticaRESUMO
The molecular diversity of prostate cancer (PCa) has been demonstrated by recent genome-wide studies, proposing a significant number of different molecular markers. However, only a few of them have been transferred into clinical practice so far. The present study aimed to identify and validate novel DNA methylation biomarkers for PCa diagnosis and prognosis. Microarray-based methylome data of well-characterized cancerous and noncancerous prostate tissue (NPT) pairs was used for the initial screening. Ten protein-coding genes were selected for validation in a set of 151 PCa, 51 NPT, as well as 17 benign prostatic hyperplasia samples. The Prostate Cancer Dataset (PRAD) of The Cancer Genome Atlas (TCGA) was utilized for independent validation of our findings. Methylation frequencies of ADAMTS12, CCDC181, FILIP1L, NAALAD2, PRKCB, and ZMIZ1 were up to 91% in our study. PCa specific methylation of ADAMTS12, CCDC181, NAALAD2, and PRKCB was demonstrated by qualitative and quantitative means (all p < 0.05). In agreement with PRAD, promoter methylation of these four genes was associated with the transcript down-regulation in the Lithuanian cohort (all p < 0.05). Methylation of ADAMTS12, NAALAD2, and PRKCB was independently predictive for biochemical disease recurrence, while NAALAD2 and PRKCB increased the prognostic power of multivariate models (all p < 0.01). The present study identified methylation of ADAMTS12, NAALAD2, and PRKCB as novel diagnostic and prognostic PCa biomarkers that might guide treatment decisions in clinical practice.
Assuntos
Proteínas ADAMTS/genética , Glutamato Carboxipeptidase II/genética , Hiperplasia Prostática/genética , Neoplasias da Próstata/genética , Proteína Quinase C beta/genética , Adulto , Idoso , Idoso de 80 Anos ou mais , Metilação de DNA/genética , Regulação Neoplásica da Expressão Gênica/genética , Humanos , Peptídeos e Proteínas de Sinalização Intracelular/genética , Masculino , Pessoa de Meia-Idade , Recidiva Local de Neoplasia/genética , Recidiva Local de Neoplasia/patologia , Regiões Promotoras Genéticas/genética , Hiperplasia Prostática/patologia , Neoplasias da Próstata/patologia , Fatores de Transcrição/genéticaRESUMO
BACKGROUND: Prostate cancer (PCa) has the highest incidence rates of cancers in men in western countries. Unlike several other types of cancer, PCa has few genetic drivers, which has led researchers to look for additional epigenetic and transcriptomic contributors to PCa development and progression. Especially datasets on DNA methylation, the most commonly studied epigenetic marker, have recently been measured and analysed in several PCa patient cohorts. DNA methylation is most commonly associated with downregulation of gene expression. However, positive associations of DNA methylation to gene expression have also been reported, suggesting a more diverse mechanism of epigenetic regulation. Such additional complexity could have important implications for understanding prostate cancer development but has not been studied at a genome-wide scale. RESULTS: In this study, we have compared three sets of genome-wide single-site DNA methylation data from 870 PCa and normal tissue samples with multi-cohort gene expression data from 1117 samples, including 532 samples where DNA methylation and gene expression have been measured on the exact same samples. Genes were classified according to their corresponding methylation and expression profiles. A large group of hypermethylated genes was robustly associated with increased gene expression (UPUP group) in all three methylation datasets. These genes demonstrated distinct patterns of correlation between DNA methylation and gene expression compared to the genes showing the canonical negative association between methylation and expression (UPDOWN group). This indicates a more diversified role of DNA methylation in regulating gene expression than previously appreciated. Moreover, UPUP and UPDOWN genes were associated with different compartments - UPUP genes were related to the structures in nucleus, while UPDOWN genes were linked to extracellular features. CONCLUSION: We identified a robust association between hypermethylation and upregulation of gene expression when comparing samples from prostate cancer and normal tissue. These results challenge the classical view where DNA methylation is always associated with suppression of gene expression, which underlines the importance of considering corresponding expression data when assessing the downstream regulatory effect of DNA methylation.
Assuntos
Metilação de DNA , DNA de Neoplasias , Epigênese Genética , Regulação Neoplásica da Expressão Gênica , Neoplasias da Próstata , Regulação para Cima , DNA de Neoplasias/genética , DNA de Neoplasias/metabolismo , Humanos , Masculino , Neoplasias da Próstata/genética , Neoplasias da Próstata/metabolismo , Neoplasias da Próstata/patologiaRESUMO
Sequencing technologies have changed not only our approaches to classical genetics, but also the field of epigenetics. Specific methods allow scientists to identify novel genome-wide epigenetic patterns of DNA methylation down to single-nucleotide resolution. DNA methylation is the most researched epigenetic mark involved in various processes in the human cell, including gene regulation and development of diseases, such as cancer. Increasing numbers of DNA methylation sequencing datasets from human genome are produced using various platforms-from methylated DNA precipitation to the whole genome bisulfite sequencing. Many of those datasets are fully accessible for repeated analyses. Sequencing experiments have become routine in laboratories around the world, while analysis of outcoming data is still a challenge among the majority of scientists, since in many cases it requires advanced computational skills. Even though various tools are being created and published, guidelines for their selection are often not clear, especially to non-bioinformaticians with limited experience in computational analyses. Separate tools are often used for individual steps in the analysis, and these can be challenging to manage and integrate. However, in some instances, tools are combined into pipelines that are capable to complete all the essential steps to achieve the result. In the case of DNA methylation sequencing analysis, the goal of such pipeline is to map sequencing reads, calculate methylation levels, and distinguish differentially methylated positions and/or regions. The objective of this review is to describe basic principles and steps in the analysis of DNA methylation sequencing data that in particular have been used for mammalian genomes, and more importantly to present and discuss the most pronounced computational pipelines that can be used to analyze such data. We aim to provide a good starting point for scientists with limited experience in computational analyses of DNA methylation and hydroxymethylation data, and recommend a few tools that are powerful, but still easy enough to use for their own data analysis.