RESUMO
The generation of functional genomics datasets is surging, because they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intent behind functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to broadly share raw reads for better statistical power and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, enabling principled privacy-utility trade-offs. Our protocol works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA sequencing. It involves quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.
Assuntos
Segurança Computacional , Genômica , Privacidade , Genoma Humano , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Fenótipo , Filogenia , Reprodutibilidade dos Testes , Análise de Sequência de RNA , Análise de Célula ÚnicaRESUMO
The Encyclopedia of DNA Elements (ENCODE) is an ongoing collaborative research project aimed at identifying all the functional elements in the human and mouse genomes. Data generated by the ENCODE consortium are freely accessible at the ENCODE portal (https://www.encodeproject.org/), which is developed and maintained by the ENCODE Data Coordinating Center (DCC). Since the initial portal release in 2013, the ENCODE DCC has updated the portal to make ENCODE data more findable, accessible, interoperable and reusable. Here, we report on recent updates, including new ENCODE data and assays, ENCODE uniform data processing pipelines, new visualization tools, a dataset cart feature, unrestricted public access to ENCODE data on the cloud (Amazon Web Services open data registry, https://registry.opendata.aws/encode-project/) and more comprehensive tutorials and documentation.
Assuntos
DNA/genética , Bases de Dados Genéticas , Genoma Humano , Software , Animais , Genômica , Humanos , CamundongosRESUMO
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.
Assuntos
DNA/genética , Bases de Dados Genéticas , Componentes do Gene , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Metadados , Animais , Caenorhabditis elegans/genética , Apresentação de Dados , Conjuntos de Dados como Assunto , Drosophila melanogaster/genética , Previsões , Genoma Humano , Humanos , Camundongos/genética , Interface Usuário-ComputadorRESUMO
Type 1 narcolepsy, a disorder caused by a lack of hypocretin (orexin), is so strongly associated with human leukocyte antigen (HLA) class II HLA-DQA1(∗)01:02-DQB1(∗)06:02 (DQ0602) that very few non-DQ0602 cases have been reported. A known triggering factor for narcolepsy is pandemic 2009 influenza H1N1, suggesting autoimmunity triggered by upper-airway infections. Additional effects of other HLA-DQ alleles have been reported consistently across multiple ethnic groups. Using over 3,000 case and 10,000 control individuals of European and Chinese background, we examined the effects of other HLA loci. After careful matching of HLA-DR and HLA-DQ in case and control individuals, we found strong protective effects of HLA-DPA1(∗)01:03-DPB1(∗)04:02 (DP0402; odds ratio [OR] = 0.51 [0.38-0.67], p = 1.01 × 10(-6)) and HLA-DPA1(∗)01:03-DPB1(∗)04:01 (DP0401; OR = 0.61 [0.47-0.80], p = 2.07 × 10(-4)) and predisposing effects of HLA-DPB1(∗)05:01 in Asians (OR = 1.76 [1.34-2.31], p = 4.71 × 10(-05)). Similar effects were found by conditional analysis controlling for HLA-DR and HLA-DQ with DP0402 (OR = 0.45 [0.38-0.55] p = 8.99 × 10(-17)) and DP0501 (OR = 1.38 [1.18-1.61], p = 7.11 × 10(-5)). HLA-class-II-independent associations with HLA-A(∗)11:01 (OR = 1.32 [1.13-1.54], p = 4.92 × 10(-4)), HLA-B(∗)35:03 (OR = 1.96 [1.41-2.70], p = 5.14 × 10(-5)), and HLA-B(∗)51:01 (OR = 1.49 [1.25-1.78], p = 1.09 × 10(-5)) were also seen across ethnic groups in the HLA class I region. These effects might reflect modulation of autoimmunity or indirect effects of HLA class I and HLA-DP alleles on response to viral infections such as that of influenza.
Assuntos
Cadeias beta de HLA-DP/genética , Antígenos de Histocompatibilidade Classe I/genética , Narcolepsia/genética , Alelos , Povo Asiático , Estudos de Casos e Controles , Estudos de Coortes , Feminino , Loci Gênicos , Antígenos HLA-B/genética , Antígenos HLA-B/metabolismo , Antígenos HLA-DP/genética , Antígenos HLA-DP/metabolismo , Cadeias beta de HLA-DP/metabolismo , Cadeias alfa de HLA-DQ/genética , Cadeias alfa de HLA-DQ/metabolismo , Antígenos HLA-DR/genética , Antígenos HLA-DR/metabolismo , Haplótipos , Antígenos de Histocompatibilidade Classe I/metabolismo , Humanos , Vírus da Influenza A Subtipo H1N1/genética , Masculino , Fatores de Risco , População BrancaRESUMO
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
RESUMO
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
RESUMO
The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.
RESUMO
Narcolepsy type 1 (NT1) is caused by a loss of hypocretin/orexin transmission. Risk factors include pandemic 2009 H1N1 influenza A infection and immunization with Pandemrix®. Here, we dissect disease mechanisms and interactions with environmental triggers in a multi-ethnic sample of 6,073 cases and 84,856 controls. We fine-mapped GWAS signals within HLA (DQ0602, DQB1*03:01 and DPB1*04:02) and discovered seven novel associations (CD207, NAB1, IKZF4-ERBB3, CTSC, DENND1B, SIRPG, PRF1). Significant signals at TRA and DQB1*06:02 loci were found in 245 vaccination-related cases, who also shared polygenic risk. T cell receptor associations in NT1 modulated TRAJ*24, TRAJ*28 and TRBV*4-2 chain-usage. Partitioned heritability and immune cell enrichment analyses found genetic signals to be driven by dendritic and helper T cells. Lastly comorbidity analysis using data from FinnGen, suggests shared effects between NT1 and other autoimmune diseases. NT1 genetic variants shape autoimmunity and response to environmental triggers, including influenza A infection and immunization with Pandemrix®.
Assuntos
Doenças Autoimunes , Vírus da Influenza A Subtipo H1N1 , Vacinas contra Influenza , Influenza Humana , Narcolepsia , Humanos , Autoimunidade/genética , Influenza Humana/epidemiologia , Influenza Humana/genética , Vírus da Influenza A Subtipo H1N1/genética , Doenças Autoimunes/epidemiologia , Doenças Autoimunes/genética , Vacinas contra Influenza/efeitos adversos , Narcolepsia/induzido quimicamente , Narcolepsia/genéticaRESUMO
The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine-readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access.
Assuntos
Cromatina/metabolismo , DNA/genética , Bases de Dados Genéticas , Epigenômica/métodos , Animais , Metilação de DNA , Genoma Humano , Humanos , Internet , Metadados , Camundongos , SoftwareRESUMO
Database URL: https://www.encodeproject.org/.