Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
1.
Cell ; 183(4): 905-917.e16, 2020 11 12.
Article in English | MEDLINE | ID: mdl-33186529

ABSTRACT

The generation of functional genomics datasets is surging, because they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intent behind functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to broadly share raw reads for better statistical power and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, enabling principled privacy-utility trade-offs. Our protocol works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA sequencing. It involves quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.


Subject(s)
Computer Security , Genomics , Privacy , Genome, Human , Genotype , High-Throughput Nucleotide Sequencing , Humans , Phenotype , Phylogeny , Reproducibility of Results , Sequence Analysis, RNA , Single-Cell Analysis
2.
Nature ; 583(7818): 744-751, 2020 07.
Article in English | MEDLINE | ID: mdl-32728240

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) project has established a genomic resource for mammalian development, profiling a diverse panel of mouse tissues at 8 developmental stages from 10.5 days after conception until birth, including transcriptomes, methylomes and chromatin states. Here we systematically examined the state and accessibility of chromatin in the developing mouse fetus. In total we performed 1,128 chromatin immunoprecipitation with sequencing (ChIP-seq) assays for histone modifications and 132 assay for transposase-accessible chromatin using sequencing (ATAC-seq) assays for chromatin accessibility across 72 distinct tissue-stages. We used integrative analysis to develop a unified set of chromatin state annotations, infer the identities of dynamic enhancers and key transcriptional regulators, and characterize the relationship between chromatin state and accessibility during developmental gene regulation. We also leveraged these data to link enhancers to putative target genes and demonstrate tissue-specific enrichments of sequence variants associated with disease in humans. The mouse ENCODE data sets provide a compendium of resources for biomedical researchers and achieve, to our knowledge, the most comprehensive view of chromatin dynamics during mammalian fetal development to date.


Subject(s)
Chromatin/genetics , Chromatin/metabolism , Datasets as Topic , Fetal Development/genetics , Histones/metabolism , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid/genetics , Animals , Chromatin/chemistry , Chromatin Immunoprecipitation Sequencing , Disease/genetics , Enhancer Elements, Genetic/genetics , Female , Gene Expression Regulation, Developmental/genetics , Genetic Variation , Histones/chemistry , Humans , Male , Mice , Mice, Inbred C57BL , Organ Specificity/genetics , Reproducibility of Results , Transposases/metabolism
5.
Nucleic Acids Res ; 48(D1): D882-D889, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31713622

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) is an ongoing collaborative research project aimed at identifying all the functional elements in the human and mouse genomes. Data generated by the ENCODE consortium are freely accessible at the ENCODE portal (https://www.encodeproject.org/), which is developed and maintained by the ENCODE Data Coordinating Center (DCC). Since the initial portal release in 2013, the ENCODE DCC has updated the portal to make ENCODE data more findable, accessible, interoperable and reusable. Here, we report on recent updates, including new ENCODE data and assays, ENCODE uniform data processing pipelines, new visualization tools, a dataset cart feature, unrestricted public access to ENCODE data on the cloud (Amazon Web Services open data registry, https://registry.opendata.aws/encode-project/) and more comprehensive tutorials and documentation.


Subject(s)
DNA/genetics , Databases, Genetic , Genome, Human , Software , Animals , Genomics , Humans , Mice
6.
Nucleic Acids Res ; 46(D1): D794-D801, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29126249

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.


Subject(s)
DNA/genetics , Databases, Genetic , Gene Components , Genomics , High-Throughput Nucleotide Sequencing , Metadata , Animals , Caenorhabditis elegans/genetics , Data Display , Datasets as Topic , Drosophila melanogaster/genetics , Forecasting , Genome, Human , Humans , Mice/genetics , User-Computer Interface
7.
Nucleic Acids Res ; 44(D1): D726-32, 2016 Jan 04.
Article in English | MEDLINE | ID: mdl-26527727

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) Project is in its third phase of creating a comprehensive catalog of functional elements in the human genome. This phase of the project includes an expansion of assays that measure diverse RNA populations, identify proteins that interact with RNA and DNA, probe regions of DNA hypersensitivity, and measure levels of DNA methylation in a wide range of cell and tissue types to identify putative regulatory elements. To date, results for almost 5000 experiments have been released for use by the scientific community. These data are available for searching, visualization and download at the new ENCODE Portal (www.encodeproject.org). The revamped ENCODE Portal provides new ways to browse and search the ENCODE data based on the metadata that describe the assays as well as summaries of the assays that focus on data provenance. In addition, it is a flexible platform that allows integration of genomic data from multiple projects. The portal experience was designed to improve access to ENCODE data by relying on metadata that allow reusability and reproducibility of the experiments.


Subject(s)
Databases, Genetic , Genome, Human , Genomics , Animals , DNA/metabolism , Genes , Humans , Mice , Proteins/metabolism , RNA/metabolism
8.
Genome Biol ; 24(1): 79, 2023 04 18.
Article in English | MEDLINE | ID: mdl-37072822

ABSTRACT

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.


Subject(s)
Algorithms , Epigenomics , Genomics/methods
9.
bioRxiv ; 2023 Apr 06.
Article in English | MEDLINE | ID: mdl-37066421

ABSTRACT

The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.

10.
Res Sq ; 2023 Jul 19.
Article in English | MEDLINE | ID: mdl-37503119

ABSTRACT

The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.

11.
bioRxiv ; 2023 May 16.
Article in English | MEDLINE | ID: mdl-37292896

ABSTRACT

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

12.
Sci Rep ; 10(1): 7933, 2020 05 13.
Article in English | MEDLINE | ID: mdl-32404971

ABSTRACT

ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.


Subject(s)
Chromatin Immunoprecipitation Sequencing , Computational Biology/methods , Neural Networks, Computer , Software , Algorithms , Binding Sites , Chromatin Immunoprecipitation Sequencing/methods , Databases, Nucleic Acid , Epigenesis, Genetic , Epigenomics/methods , Histones/metabolism , Humans , Nucleotide Motifs , Protein Binding , Transcription Initiation Site
13.
Curr Protoc Bioinformatics ; 68(1): e89, 2019 12.
Article in English | MEDLINE | ID: mdl-31751002

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine-readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. Ā© 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access.


Subject(s)
Chromatin/metabolism , DNA/genetics , Databases, Genetic , Epigenomics/methods , Animals , DNA Methylation , Genome, Human , Humans , Internet , Metadata , Mice , Software
15.
Mol Cell Biol ; 23(24): 9275-82, 2003 Dec.
Article in English | MEDLINE | ID: mdl-14645537

ABSTRACT

Single-copy gene and promoter regions have been excised from yeast chromosomes and have been purified as chromatin by conventional and affinity methods. Promoter regions isolated in transcriptionally repressed and activated states maintain their characteristic chromatin structures. Gel filtration analysis establishes the uniformity of the transcriptionally activated state. Activator proteins interact in the manner anticipated from previous studies in vivo. This work opens the way to the direct study of specific gene regions of eukaryotic chromosomes in diverse functional and structural states.


Subject(s)
Chromatin/genetics , Chromatin/isolation & purification , Chromosomes, Fungal/genetics , Saccharomyces cerevisiae/genetics , Chromatography, Affinity , DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Genes, Fungal , Homeodomain Proteins/genetics , Homeodomain Proteins/metabolism , Promoter Regions, Genetic , Recombination, Genetic , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism , Trans-Activators/genetics , Trans-Activators/metabolism , Transcription Factors/genetics , Transcription Factors/metabolism , Transcriptional Activation
16.
PLoS One ; 12(4): e0175310, 2017.
Article in English | MEDLINE | ID: mdl-28403240

ABSTRACT

The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.


Subject(s)
Databases, Genetic , Genomics/methods , Metadata , Software , Animals , DNA/genetics , Genome , Humans , Mice
17.
Article in English | MEDLINE | ID: mdl-26980513

ABSTRACT

The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) is responsible for organizing, describing and providing access to the diverse data generated by the ENCODE project. The description of these data, known as metadata, includes the biological sample used as input, the protocols and assays performed on these samples, the data files generated from the results and the computational methods used to analyze the data. Here, we outline the principles and philosophy used to define the ENCODE metadata in order to create a metadata standard that can be applied to diverse assays and multiple genomic projects. In addition, we present how the data are validated and used by the ENCODE DCC in creating the ENCODE Portal (https://www.encodeproject.org/). Database URL: www.encodeproject.org.


Subject(s)
Computational Biology/methods , DNA/genetics , Databases, Genetic , Algorithms , Animals , Caenorhabditis elegans , Computational Biology/standards , Data Collection , Drosophila melanogaster , High-Throughput Nucleotide Sequencing , Humans , Mice , Nucleic Acids/genetics , Quality Control , Reproducibility of Results , Sequence Alignment
18.
FEBS Lett ; 579(4): 899-903, 2005 Feb 07.
Article in English | MEDLINE | ID: mdl-15680971

ABSTRACT

An RNA polymerase II promoter has been isolated in transcriptionally activated and repressed states. Topological and nuclease digestion analyses have revealed a dynamic equilibrium between nucleosome removal and reassembly upon transcriptional activation, and have further shown that nucleosomes are removed by eviction of histone octamers rather than by sliding. The promoter, once exposed, assembles with RNA polymerase II, general transcription factors, and Mediator in a approximately 3 MDa transcription initiation complex. X-ray crystallography has revealed the structure of RNA polymerase II, in the act of transcription, at atomic resolution. Extension of this analysis has shown how nucleotides undergo selection, polymerization, and eventual release from the transcribing complex. X-ray and electron crystallography have led to a picture of the entire transcription initiation complex, elucidating the mechanisms of promoter recognition, DNA unwinding, abortive initiation, and promoter escape.


Subject(s)
Eukaryotic Cells/metabolism , Nucleosomes/chemistry , Promoter Regions, Genetic/genetics , RNA Polymerase II/chemistry , Transcription, Genetic , Crystallography, X-Ray , Molecular Structure
19.
Article in English | MEDLINE | ID: mdl-25776021

ABSTRACT

The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a catalog of genomic annotations. To date, the project has generated over 4000 experiments across more than 350 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory network and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All ENCODE experimental data, metadata and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage and distribution to community resources and the scientific community. As the volume of data increases, the organization of experimental details becomes increasingly complicated and demands careful curation to identify related experiments. Here, we describe the ENCODE DCC's use of ontologies to standardize experimental metadata. We discuss how ontologies, when used to annotate metadata, provide improved searching capabilities and facilitate the ability to find connections within a set of experiments. Additionally, we provide examples of how ontologies are used to annotate ENCODE metadata and how the annotations can be identified via ontology-driven searches at the ENCODE portal. As genomic datasets grow larger and more interconnected, standardization of metadata becomes increasingly vital to allow for exploration and comparison of data between different scientific projects.


Subject(s)
Data Curation/methods , Databases, Genetic , Gene Ontology , Gene Regulatory Networks/physiology , Molecular Sequence Annotation/methods , Transcription, Genetic/physiology , Animals , Humans , Mice
SELECTION OF CITATIONS
SEARCH DETAIL