RESUMEN
RegulonDB provides curated information on the transcriptional regulatory network of Escherichia coli and contains both experimental data and computationally predicted objects. To account for the heterogeneity of these data, we introduced in version 6.0, a two-tier rating system for the strength of evidence, classifying evidence as either 'weak' or 'strong' (Gama-Castro,S., Jimenez-Jacinto,V., Peralta-Gil,M. et al. RegulonDB (Version 6.0): gene regulation model of Escherichia Coli K-12 beyond transcription, active (experimental) annotated promoters and textpresso navigation. Nucleic Acids Res., 2008;36:D120-D124.). We now add to our classification scheme the classification of high-throughput evidence, including chromatin immunoprecipitation (ChIP) and RNA-seq technologies. To integrate these data into RegulonDB, we present two strategies for the evaluation of confidence, statistical validation and independent cross-validation. Statistical validation involves verification of ChIP data for transcription factor-binding sites, using tools for motif discovery and quality assessment of the discovered matrices. Independent cross-validation combines independent evidence with the intention to mutually exclude false positives. Both statistical validation and cross-validation allow to upgrade subsets of data that are supported by weak evidence to a higher confidence level. Likewise, cross-validation of strong confidence data extends our two-tier rating system to a three-tier system by introducing a third confidence score 'confirmed'. Database URL: http://regulondb.ccg.unam.mx/
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Escherichia coli/genética , Regulón/genética , Estadística como Asunto , Vías Biosintéticas/genética , Inmunoprecipitación de Cromatina , Regulación Bacteriana de la Expresión Génica , Redes Reguladoras de Genes , Posición Específica de Matrices de Puntuación , Reproducibilidad de los Resultados , Sitio de Iniciación de la TranscripciónRESUMEN
RegulonDB (http://regulondb.ccg.unam.mx/) is the primary reference database offering curated knowledge of the transcriptional regulatory network of Escherichia coli K12, currently the best-known electronically encoded database of the genetic regulatory network of any free-living organism. This paper summarizes the improvements, new biology and new features available in version 6.0. Curation of original literature is, from now on, up to date for every new release. All the objects are supported by their corresponding evidences, now classified as strong or weak. Transcription factors are classified by origin of their effectors and by gene ontology class. We have now computational predictions for sigma(54) and five different promoter types of the sigma(70) family, as well as their corresponding -10 and -35 boxes. In addition to those curated from the literature, we added about 300 experimentally mapped promoters coming from our own high-throughput mapping efforts. RegulonDB v.6.0 now expands beyond transcription initiation, including RNA regulatory elements, specifically riboswitches, attenuators and small RNAs, with their known associated targets. The data can be accessed through overviews of correlations about gene regulation. RegulonDB associated original literature, together with more than 4000 curation notes, can now be searched with the Textpresso text mining engine.
Asunto(s)
Bases de Datos Genéticas , Escherichia coli K12/genética , Regulación Bacteriana de la Expresión Génica , Redes Reguladoras de Genes , Biología Computacional , Internet , Modelos Genéticos , Regiones Promotoras Genéticas , Secuencias Reguladoras de Ácido Ribonucleico , Regulón , Factor sigma/metabolismo , Programas Informáticos , Factores de Transcripción/metabolismo , Sitio de Iniciación de la Transcripción , Transcripción GenéticaRESUMEN
Experimental data on the Escherichia coli transcriptional regulation has enabled the construction of statistical models to predict new regulatory elements within its genome. Far less is known about the transcriptional regulatory elements in other gamma-proteobacteria with sequenced genomes, so it is of great interest to conduct comparative genomic studies oriented to extracting biologically relevant information about transcriptional regulation in these less studied organisms using the knowledge from E. coli. In this work, we use the information stored in the TRACTOR_DB database to conduct a comparative study on the mechanisms of transcriptional regulation in eight gamma-proteobacteria and 38 regulons. We assess the conservation of transcription factors binding specificity across all the eight genomes and show a correlation between the conservation of a regulatory site and the structure of the transcription unit it regulates. We also find a marked conservation of site-promoter distances across the eight organisms and a correspondence of the statistical significance of co-occurrence of pairs of transcription factor binding sites in the regulatory regions, which is probably related to a conserved architecture of higher-order regulatory complexes in the organisms studied. The results obtained in this study using the information on transcriptional regulation in E. coli enable us to conclude that not only transcription factor-binding sites are conserved across related species but also several of the transcriptional regulatory mechanisms previously identified in E. coli.
Asunto(s)
Biología Computacional , Gammaproteobacteria/genética , Regulación Bacteriana de la Expresión Génica , Genoma Bacteriano , Transcripción Genética , Sitios de Unión/genética , Regiones Promotoras Genéticas , Regulón , Sintenía , Factores de Transcripción/genéticaRESUMEN
We present here a computational analysis showing that sigma70 house-keeping promoters are located within zones with high densities of promoter-like signals in Escherichia coli, and we introduce strategies that allow for the correct computer prediction of sigma70 promoters. Based on 599 experimentally verified promoters of E.coli K-12, we generated and evaluated more than 200 weight matrices optimizing different criteria to obtain the best recognition matrices. The alignments generating the best statistical models did not fully correspond with the canonical sigma70 model. However, matrices that correspond to such a canonical model performed better as tools for prediction. We tested the predictive capacity of these matrices on 250 bp long regions upstream of gene starts, where 90% of the known promoters occur. The computational matrix models generated an average of 38 promoter-like signals within each 250 bp region. In more than 50% of the cases, the true promoter does not have the best score within the region. We observed, in fact, that real promoters occur mostly within regions with high densities of overlapping putative promoters. We evaluated several strategies to identify promoters. The best one uses an intrinsic score of the -10 and -35 hexamers that form the promoter as well as an extrinsic score that uses the distribution of promoters from the start of the gene. We were able to identify 86% true promoters correctly, generating an average of 4.7 putative promoters per region as output, of which 3.7, on average, exist in clusters, as a series of overlapping potentially competing RNA polymerase-binding sites. As far as we know, this is the highest predictive capability reported so far. This high signal density is found mainly within regions upstream of genes, contrasting with coding regions and regions located between convergently transcribed genes. These results are consistent with experimental evidence that show the existence of multiple overlapping promoter sites that become functional under particular conditions. This density is probably the consequence of a rich number of vestiges of promoters in evolution. We suggest that transcriptional regulators as well as other functional promoters play an important role in keeping these latent signals suppressed.