ABSTRACT
Genomics has set the basis for a variety of methodologies that produce high-throughput datasets identifying the different players that define gene regulation, particularly regulation of transcription initiation and operon organization. These datasets are available in public repositories, such as the Gene Expression Omnibus, or ArrayExpress. However, accessing and navigating such a wealth of data is not straightforward. No resource currently exists that offers all available high and low-throughput data on transcriptional regulation in Escherichia coli K-12 to easily use both as whole datasets, or as individual interactions and regulatory elements. RegulonDB (https://regulondb.ccg.unam.mx) began gathering high-throughput dataset collections in 2009, starting with transcription start sites, then adding ChIP-seq and gSELEX in 2012, with up to 99 different experimental high-throughput datasets available in 2019. In this paper we present a radical upgrade to more than 2000 high-throughput datasets, processed to facilitate their comparison, introducing up-to-date collections of transcription termination sites, transcription units, as well as transcription factor binding interactions derived from ChIP-seq, ChIP-exo, gSELEX and DAP-seq experiments, besides expression profiles derived from RNA-seq experiments. For ChIP-seq experiments we offer both the data as presented by the authors, as well as data uniformly processed in-house, enhancing their comparability, as well as the traceability of the methods and reproducibility of the results. Furthermore, we have expanded the tools available for browsing and visualization across and within datasets. We include comparisons against previously existing knowledge in RegulonDB from classic experiments, a nucleotide-resolution genome viewer, and an interface that enables users to browse datasets by querying their metadata. A particular effort was made to automatically extract detailed experimental growth conditions by implementing an assisted curation strategy applying Natural language processing and machine learning. We provide summaries with the total number of interactions found in each experiment, as well as tools to identify common results among different experiments. This is a long-awaited resource to make use of such wealth of knowledge and advance our understanding of the biology of the model bacterium E. coli K-12.
Subject(s)
Escherichia coli K12 , Escherichia coli , Escherichia coli/genetics , Escherichia coli K12/genetics , Escherichia coli K12/metabolism , Gene Expression Regulation, Bacterial , Operon/genetics , Reproducibility of ResultsABSTRACT
Genetic variation allows the malaria parasite Plasmodium falciparum to overcome chemotherapeutic agents, vaccines and vector control strategies and remain a leading cause of global morbidity and mortality. Here we describe an initial survey of genetic variation across the P. falciparum genome. We performed extensive sequencing of 16 geographically diverse parasites and identified 46,937 SNPs, demonstrating rich diversity among P. falciparum parasites (pi = 1.16 x 10(-3)) and strong correlation with gene function. We identified multiple regions with signatures of selective sweeps in drug-resistant parasites, including a previously unidentified 160-kb region with extremely low polymorphism in pyrimethamine-resistant parasites. We further characterized 54 worldwide isolates by genotyping SNPs across 20 genomic regions. These data begin to define population structure among African, Asian and American groups and illustrate the degree of linkage disequilibrium, which extends over relatively short distances in African parasites but over longer distances in Asian parasites. We provide an initial map of genetic diversity in P. falciparum and demonstrate its potential utility in identifying genes subject to recent natural selection and in understanding the population genetics of this parasite.