RESUMO
Recently developed single-cell technologies allow researchers to characterize cell states at ever greater resolution and scale. Caenorhabditis elegans is a particularly tractable system for studying development, and recent single-cell RNA-seq studies characterized the gene expression patterns for nearly every cell type in the embryo and at the second larval stage (L2). Gene expression patterns give insight about gene function and into the biochemical state of different cell types; recent advances in other single-cell genomics technologies can now also characterize the regulatory context of the genome that gives rise to these gene expression levels at a single-cell resolution. To explore the regulatory DNA of individual cell types in C. elegans, we collected single-cell chromatin accessibility data using the sci-ATAC-seq assay in L2 larvae to match the available single-cell RNA-seq data set. By using a novel implementation of the latent Dirichlet allocation algorithm, we identify 37 clusters of cells that correspond to different cell types in the worm, providing new maps of putative cell type-specific gene regulatory sites, with promise for better understanding of cellular differentiation and gene regulation.
Assuntos
Caenorhabditis elegans , Cromatina , Animais , Caenorhabditis elegans/genética , Cromatina/genética , Sequenciamento de Cromatina por Imunoprecipitação , DNA/genética , Regulação da Expressão GênicaRESUMO
Single cell ATAC-seq (scATAC-seq) enables the mapping of regulatory elements in fine-grained cell types. Despite this advance, analysis of the resulting data is challenging, and large scale scATAC-seq data are difficult to obtain and expensive to generate. This motivates a method to leverage information from previously generated large scale scATAC-seq or scRNA-seq data to guide our analysis of new scATAC-seq datasets. We analyze scATAC-seq data using latent Dirichlet allocation (LDA), a Bayesian algorithm that was developed to model text corpora, summarizing documents as mixtures of topics defined based on the words that distinguish the documents. When applied to scATAC-seq, LDA treats cells as documents and their accessible sites as words, identifying "topics" based on the cell type-specific accessible sites in those cells. Previous work used uniform symmetric priors in LDA, but we hypothesized that nonuniform matrix priors generated from LDA models trained on existing data sets may enable improved detection of cell types in new data sets, especially if they have relatively few cells. In this work, we test this hypothesis in scATAC-seq data from whole C. elegans nematodes and SHARE-seq data from mouse skin cells. We show that nonsymmetric matrix priors for LDA improve our ability to capture cell type information from small scATAC-seq datasets.
Assuntos
Algoritmos , Caenorhabditis elegans , Animais , Camundongos , Caenorhabditis elegans/genética , Teorema de Bayes , Cromatina , Sequências Reguladoras de Ácido Nucleico , Análise de Célula Única/métodosRESUMO
Chromatin immunoprecipitation (IP) followed by sequencing (ChIP-seq) is the gold standard to detect transcription-factor (TF) binding sites in the genome. Its success depends on appropriate controls removing systematic biases. The predominantly used controls, i.e. DNA input, correct for uneven sonication, but not for nonspecific interactions of the IP antibody. Another type of controls, 'mock' IP, corrects for both of the issues, but is not widely used because it is considered susceptible to technical noise. The tradeoff between the two control types has not been investigated systematically. Therefore, we generated comparable DNA input and mock IP experiments. Because mock IPs contain only nonspecific interactions, the sites predicted from them using DNA input indicate the spurious-site abundance. This abundance is highly correlated with the 'genomic activity' (e.g. chromatin openness). In particular, compared to cell lines, complex samples such as whole organisms have more spurious sites-probably because they contain multiple cell types, resulting in more expressed genes and more open chromatin. Consequently, DNA input and mock IP controls performed similarly for cell lines, whereas for complex samples, mock IP substantially reduced the number of spurious sites. However, DNA input is still informative; thus, we developed a simple framework integrating both controls, improving binding site detection.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação/métodos , Fatores de Transcrição/metabolismo , Anticorpos , Sítios de Ligação , Linhagem Celular , DNA , HumanosRESUMO
We have used RNA-seq in Caenorhabditis elegans to produce transcription profiles for seven specific embryonic cell populations from gastrulation to the onset of terminal differentiation. The expression data for these seven cell populations, covering major cell lineages and tissues in the worm, reveal the complex and dynamic changes in gene expression, both spatially and temporally. Also, within genes, start sites and exon usage can be highly differential, producing transcripts that are specific to developmental periods or cell lineages. We have also found evidence of novel exons and introns, as well as differential usage of SL1 and SL2 splice leaders. By combining this data set with the modERN ChIP-seq resource, we are able to support and predict gene regulatory relationships. The detailed information on differences and similarities between gene expression in cell lineages and tissues should be of great value to the community and provides a framework for the investigation of expression in individual cells.
Assuntos
Processamento Alternativo , Caenorhabditis elegans/genética , Desenvolvimento Embrionário/genética , Transcriptoma , Animais , Caenorhabditis elegans/embriologia , Biologia Computacional/métodos , Éxons , Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento , Íntrons , Anotação de Sequência Molecular , Especificidade de Órgãos , Edição de RNA , Sítios de Splice de RNARESUMO
Discovering the structure and dynamics of transcriptional regulatory events in the genome with cellular and temporal resolution is crucial to understanding the regulatory underpinnings of development and disease. We determined the genomic distribution of binding sites for 92 transcription factors and regulatory proteins across multiple stages of Caenorhabditis elegans development by performing 241 ChIP-seq (chromatin immunoprecipitation followed by sequencing) experiments. Integration of regulatory binding and cellular-resolution expression data produced a spatiotemporally resolved metazoan transcription factor binding map. Using this map, we explore developmental regulatory circuits that encode combinatorial logic at the levels of co-binding and co-expression of transcription factors, characterizing the genomic coverage and clustering of regulatory binding, the binding preferences of, and biological processes regulated by, transcription factors, the global transcription factor co-associations and genomic subdomains that suggest shared patterns of regulation, and identifying key transcription factors and transcription factor co-associations for fate specification of individual lineages and cell types.
Assuntos
Caenorhabditis elegans/crescimento & desenvolvimento , Caenorhabditis elegans/genética , Regulação da Expressão Gênica no Desenvolvimento/genética , Genoma Helmíntico/genética , Análise Espaço-Temporal , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Caenorhabditis elegans/citologia , Caenorhabditis elegans/embriologia , Proteínas de Caenorhabditis elegans/metabolismo , Linhagem da Célula , Imunoprecipitação da Cromatina , Genômica , Larva/citologia , Larva/genética , Larva/crescimento & desenvolvimento , Larva/metabolismo , Ligação ProteicaRESUMO
A catalog of transcription factor (TF) binding sites in the genome is critical for deciphering regulatory relationships. Here we present the culmination of the modERN (model organism Encyclopedia of Regulatory Networks) consortium that systematically assayed TF binding events in vivo in two major model organisms, Drosophila melanogaster (fly) and Caenorhabditis elegans (worm). We describe key features of these datasets, comprising 604 TFs identifying 3.6M sites in the fly and 350 TFs identifying 0.9 M sites in the worm. Applying a machine learning model to these data identifies sets of TFs with a prominent role in promoting target gene expression in specific cell types. TF binding data are available through the ENCODE Data Coordinating Center and at https://epic.gs.washington.edu/modERNresource, which provides access to processed and summary data, as well as widgets to probe cell type-specific TF-target relationships. These data are a rich resource that should fuel investigations into TF function during development.
RESUMO
Gene activity defines cell identity, drives intercellular communication, and underlies the functioning of multicellular organisms. We present the single-cell resolution atlas of gene activity of a fertile adult metazoan: Caenorhabditis elegans. This compendium comprises 180 distinct cell types and 19,657 expressed genes. We predict 7541 transcription factor expression profile associations likely responsible for defining cellular identity. We predict thousands of intercellular interactions across the C. elegans body and the ligand-receptor pairs that mediate them, some of which we experimentally validate. We identify 172 genes that show consistent expression across cell types, are involved in basic and essential functions, and are conserved across phyla; therefore, we present them as experimentally validated housekeeping genes. We developed the WormSeq application to explore these data. In addition to the integrated gene-to-systems biology, we present genome-scale single-cell resolution testable hypotheses that we anticipate will advance our understanding of the molecular mechanisms, underlying the functioning of a multicellular organism and the perturbations that lead to its malfunction.
Assuntos
Proteínas de Caenorhabditis elegans , Caenorhabditis elegans , Animais , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Fatores de Transcrição/metabolismo , Regulação da Expressão Gênica , Expressão GênicaRESUMO
Transcription factors (TFs) play a key role in development and in cellular responses to the environment by activating or repressing the transcription of target genes in precise spatial and temporal patterns. In order to develop a catalog of target genes of Drosophila melanogaster TFs, the modERN consortium systematically knocked down the expression of TFs using RNAi in whole embryos followed by RNA-seq. We generated data for 45 TFs which have 18 different DNA-binding domains and are expressed in 15 of the 16 organ systems. The range of inactivation of the targeted TFs by RNAi ranged from log2fold change -3.52 to +0.49. The TFs also showed remarkable heterogeneity in the numbers of candidate target genes identified, with some generating thousands of candidates and others only tens. We present detailed analysis from five experiments, including those for three TFs that have been the focus of previous functional studies (ERR, sens, and zfh2) and two previously uncharacterized TFs (sens-2 and CG32006), as well as short vignettes for selected additional experiments to illustrate the utility of this resource. The RNA-seq datasets are available through the ENCODE DCC (http://encodeproject.org) and the Sequence Read Archive (SRA). TF and target gene expression patterns can be found here: https://insitu.fruitfly.org. These studies provide data that facilitate scientific inquiries into the functions of individual TFs in key developmental, metabolic, defensive, and homeostatic regulatory pathways, as well as provide a broader perspective on how individual TFs work together in local networks during embryogenesis.
Assuntos
Proteínas de Drosophila , Drosophila , Animais , Drosophila/metabolismo , Drosophila melanogaster/metabolismo , Interferência de RNA , Fatores de Transcrição/metabolismo , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Receptores de Estrogênio/genética , Receptores de Estrogênio/metabolismo , Proteínas de Ligação a DNA/genéticaRESUMO
To develop a catalog of regulatory sites in two major model organisms, Drosophila melanogaster and Caenorhabditis elegans, the modERN (model organism Encyclopedia of Regulatory Networks) consortium has systematically assayed the binding sites of transcription factors (TFs). Combined with data produced by our predecessor, modENCODE (Model Organism ENCyclopedia Of DNA Elements), we now have data for 262 TFs identifying 1.23 M sites in the fly genome and 217 TFs identifying 0.67 M sites in the worm genome. Because sites from different TFs are often overlapping and tightly clustered, they fall into 91,011 and 59,150 regions in the fly and worm, respectively, and these binding sites span as little as 8.7 and 5.8 Mb in the two organisms. Clusters with large numbers of sites (so-called high occupancy target, or HOT regions) predominantly associate with broadly expressed genes, whereas clusters containing sites from just a few factors are associated with genes expressed in tissue-specific patterns. All of the strains expressing GFP-tagged TFs are available at the stock centers, and the chromatin immunoprecipitation sequencing data are available through the ENCODE Data Coordinating Center and also through a simple interface (http://epic.gs.washington.edu/modERN/) that facilitates rapid accessibility of processed data sets. These data will facilitate a vast number of scientific inquiries into the function of individual TFs in key developmental, metabolic, and defense and homeostatic regulatory pathways, as well as provide a broader perspective on how individual TFs work together in local networks and globally across the life spans of these two key model organisms.
Assuntos
Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Bases de Dados Genéticas , Drosophila/genética , Drosophila/metabolismo , Estudo de Associação Genômica Ampla , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Imunoprecipitação da Cromatina , Estudo de Associação Genômica Ampla/métodos , Modelos Biológicos , Motivos de Nucleotídeos , Ligação ProteicaRESUMO
Mutants remain a powerful means for dissecting gene function in model organisms such as Caenorhabditis elegans Massively parallel sequencing has simplified the detection of variants after mutagenesis but determining precisely which change is responsible for phenotypic perturbation remains a key step. Genetic mapping paradigms in C. elegans rely on bulk segregant populations produced by crosses with the problematic Hawaiian wild isolate and an excess of redundant information from whole-genome sequencing (WGS). To increase the repertoire of available mutants and to simplify identification of the causal change, we performed WGS on 173 temperature-sensitive (TS) lethal mutants and devised a novel mapping method. The mapping method uses molecular inversion probes (MIP-MAP) in a targeted sequencing approach to genetic mapping, and replaces the Hawaiian strain with a Million Mutation Project strain with high genomic and phenotypic similarity to the laboratory wild-type strain N2 We validated MIP-MAP on a subset of the TS mutants using a competitive selection approach to produce TS candidate mapping intervals with a mean size < 3 Mb. MIP-MAP successfully uses a non-Hawaiian mapping strain and multiplexed libraries are sequenced at a fraction of the cost of WGS mapping approaches. Our mapping results suggest that the collection of TS mutants contains a diverse library of TS alleles for genes essential to development and reproduction. MIP-MAP is a robust method to genetically map mutations in both viable and essential genes and should be adaptable to other organisms. It may also simplify tracking of individual genotypes within population mixtures.
Assuntos
Caenorhabditis elegans/genética , Mapeamento Cromossômico/métodos , Cromossomos/genética , Mutação , Termotolerância/genética , Sequenciamento Completo do Genoma/métodos , Animais , Caenorhabditis elegans/fisiologia , Proteínas de Caenorhabditis elegans/genética , Mapeamento Cromossômico/normas , Sequenciamento Completo do Genoma/normasRESUMO
Advances in microscopy and fluorescent reporters have allowed us to detect the onset of gene expression on a cell-by-cell basis in a systemic fashion. This information, however, is often encoded in large repositories of images, and developing ways to extract this spatiotemporal expression data is a difficult problem that often uses complex domain-specific methods for each individual data set. We present a more unified approach that incorporates general previous information into a hierarchical probabilistic model to extract spatiotemporal gene expression from 4D confocal microscopy images of developing Caenorhabditis elegans embryos. This approach reduces the overall error rate of our automated lineage tracing pipeline by 3.8-fold, allowing us to routinely follow the C. elegans lineage to later stages of development, where individual neuronal subspecification becomes apparent. Unlike previous methods that often use custom approaches that are organism specific, our method uses generalized linear models and extensions of standard reversible jump Markov chain Monte Carlo methods that can be readily extended to other organisms for a variety of biological inference problems relating to cell fate specification. This modeling approach is flexible and provides tractable avenues for incorporating additional previous information into the model for similar difficult high-fidelity/low error tolerance image analysis problems for systematically applied genomic experiments.