RESUMO
Newborn mice emit signals that promote parenting from mothers and fathers but trigger aggressive responses from virgin males. Although pup-directed attacks by males require vomeronasal function, the specific infant cues that elicit this behavior are unknown. We developed a behavioral paradigm based on reconstituted pup cues and showed that discrete infant morphological features combined with salivary chemosignals elicit robust male aggression. Seven vomeronasal receptors were identified based on infant-mediated activity, and the involvement of two receptors, Vmn2r65 and Vmn2r88, in infant-directed aggression was demonstrated by genetic deletion. Using the activation of these receptors as readouts for biochemical fractionation, we isolated two pheromonal compounds, the submandibular gland protein C and hemoglobins. Unexpectedly, none of the identified vomeronasal receptors and associated cues were specific to pups. Thus, infant-mediated aggression by virgin males relies on the recognition of pup's physical traits in addition to parental and infant chemical cues.
Assuntos
Agressão , Órgão Vomeronasal/metabolismo , Animais , Animais Recém-Nascidos , Deleção de Genes , Masculino , Camundongos , Camundongos MutantesRESUMO
The function of some genetic variants associated with brain-relevant traits has been explained through colocalization with expression quantitative trait loci (eQTL) conducted in bulk postmortem adult brain tissue. However, many brain-trait associated loci have unknown cellular or molecular function. These genetic variants may exert context-specific function on different molecular phenotypes including post-transcriptional changes. Here, we identified genetic regulation of RNA editing and alternative polyadenylation (APA) within a cell-type-specific population of human neural progenitors and neurons. More RNA editing and isoforms utilizing longer polyadenylation sequences were observed in neurons, likely due to higher expression of genes encoding the proteins mediating these post-transcriptional events. We also detected hundreds of cell-type-specific editing quantitative trait loci (edQTLs) and alternative polyadenylation QTLs (apaQTLs). We found colocalizations of a neuron edQTL in CCDC88A with educational attainment and a progenitor apaQTL in EP300 with schizophrenia, suggesting that genetically mediated post-transcriptional regulation during brain development leads to differences in brain function.
Assuntos
Neurogênese , Neurônios , Locos de Características Quantitativas , Humanos , Neurogênese/genética , Neurônios/metabolismo , Edição de RNA/genética , Poliadenilação/genética , Esquizofrenia/genética , Regulação da Expressão Gênica , Células-Tronco Neurais/metabolismo , Células-Tronco Neurais/citologia , Encéfalo/metabolismo , Processamento Pós-Transcricional do RNA/genéticaRESUMO
Understanding the molecular mechanisms of complex traits is essential for developing targeted interventions. We analyzed liver expression quantitative-trait locus (eQTL) meta-analysis data on 1,183 participants to identify conditionally distinct signals. We found 9,013 eQTL signals for 6,564 genes; 23% of eGenes had two signals, and 6% had three or more signals. We then integrated the eQTL results with data from 29 cardiometabolic genome-wide association study (GWAS) traits and identified 1,582 GWAS-eQTL colocalizations for 747 eGenes. Non-primary eQTL signals accounted for 17% of all colocalizations. Isolating signals by conditional analysis prior to coloc resulted in 37% more colocalizations than using marginal eQTL and GWAS data, highlighting the importance of signal isolation. Isolating signals also led to stronger evidence of colocalization: among 343 eQTL-GWAS signal pairs in multi-signal regions, analyses that isolated the signals of interest resulted in higher posterior probability of colocalization for 41% of tests. Leveraging allelic heterogeneity, we predicted causal effects of gene expression on liver traits for four genes. To predict functional variants and regulatory elements, we colocalized eQTL with liver chromatin accessibility QTL (caQTL) and found 391 colocalizations, including 73 with non-primary eQTL signals and 60 eQTL signals that colocalized with both a caQTL and a GWAS signal. Finally, we used publicly available massively parallel reporter assays in HepG2 to highlight 14 eQTL signals that include at least one expression-modulating variant. This multi-faceted approach to unraveling the genetic underpinnings of liver-related traits could lead to therapeutic development.
Assuntos
Estudo de Associação Genômica Ampla , Fígado , Locos de Características Quantitativas , Humanos , Alelos , Doenças Cardiovasculares/genética , Predisposição Genética para Doença , Fígado/metabolismo , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.
Assuntos
Software , Humanos , Biologia Computacional/métodos , Leucócitos Mononucleares/metabolismo , Leucócitos Mononucleares/citologia , Genômica/métodos , Análise de DadosRESUMO
Three-dimensional (3D) chromatin structure has been shown to play a role in regulating gene transcription during biological transitions. Although our understanding of loop formation and maintenance is rapidly improving, much less is known about the mechanisms driving changes in looping and the impact of differential looping on gene transcription. One limitation has been a lack of well-powered differential looping data sets. To address this, we conducted a deeply sequenced Hi-C time course of megakaryocyte development comprising four biological replicates and 6 billion reads per time point. Statistical analysis revealed 1503 differential loops. Gained loop anchors were enriched for AP-1 occupancy and were characterized by large increases in histone H3K27ac (over 11-fold) but relatively small increases in CTCF and RAD21 binding (1.26- and 1.23-fold, respectively). Linear modeling revealed that changes in histone H3K27ac, chromatin accessibility, and JUN binding were better correlated with changes in looping than RAD21 and almost as well correlated as CTCF. Changes to epigenetic features between-rather than at-boundaries were highly predictive of changes in looping. Together these data suggest that although CTCF and RAD21 may be the core machinery dictating where loops form, other features (both at the anchors and within the loop boundaries) may play a larger role than previously anticipated in determining the relative loop strength across cell types and conditions.
Assuntos
Cromatina , Histonas , Histonas/metabolismo , Fator de Ligação a CCCTC/genética , Fator de Ligação a CCCTC/metabolismo , Cromatina/genética , Cromossomos/metabolismo , Diferenciação Celular/genéticaRESUMO
Most approaches to transcript quantification rely on fixed reference annotations; however, the transcriptome is dynamic and depending on the context, such static annotations contain inactive isoforms for some genes, whereas they are incomplete for others. Here we present Bambu, a method that performs machine-learning-based transcript discovery to enable quantification specific to the context of interest using long-read RNA-sequencing. To identify novel transcripts, Bambu estimates the novel discovery rate, which replaces arbitrary per-sample thresholds with a single, interpretable, precision-calibrated parameter. Bambu retains the full-length and unique read counts, enabling accurate quantification in presence of inactive isoforms. Compared to existing methods for transcript discovery, Bambu achieves greater precision without sacrificing sensitivity. We show that context-aware annotations improve quantification for both novel and known transcripts. We apply Bambu to quantify isoforms from repetitive HERVH-LTR7 retrotransposons in human embryonic stem cells, demonstrating the ability for context-specific transcript expression analysis.
Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Humanos , RNA-Seq , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Isoformas de Proteínas/genéticaRESUMO
Integrative approaches that simultaneously model multi-omics data have gained increasing popularity because they provide holistic system biology views of multiple or all components in a biological system of interest. Canonical correlation analysis (CCA) is a correlation-based integrative method designed to extract latent features shared between multiple assays by finding the linear combinations of features-referred to as canonical variables (CVs)-within each assay that achieve maximal across-assay correlation. Although widely acknowledged as a powerful approach for multi-omics data, CCA has not been systematically applied to multi-omics data in large cohort studies, which has only recently become available. Here, we adapted sparse multiple CCA (SMCCA), a widely-used derivative of CCA, to proteomics and methylomics data from the Multi-Ethnic Study of Atherosclerosis (MESA) and Jackson Heart Study (JHS). To tackle challenges encountered when applying SMCCA to MESA and JHS, our adaptations include the incorporation of the Gram-Schmidt (GS) algorithm with SMCCA to improve orthogonality among CVs, and the development of Sparse Supervised Multiple CCA (SSMCCA) to allow supervised integration analysis for more than two assays. Effective application of SMCCA to the two real datasets reveals important findings. Applying our SMCCA-GS to MESA and JHS, we identified strong associations between blood cell counts and protein abundance, suggesting that adjustment of blood cell composition should be considered in protein-based association studies. Importantly, CVs obtained from two independent cohorts also demonstrate transferability across the cohorts. For example, proteomic CVs learned from JHS, when transferred to MESA, explain similar amounts of blood cell count phenotypic variance in MESA, explaining 39.0% ~ 50.0% variation in JHS and 38.9% ~ 49.1% in MESA. Similar transferability was observed for other omics-CV-trait pairs. This suggests that biologically meaningful and cohort-agnostic variation is captured by CVs. We anticipate that applying our SMCCA-GS and SSMCCA on various cohorts would help identify cohort-agnostic biologically meaningful relationships between multi-omics data and phenotypic traits.
Assuntos
Análise de Correlação Canônica , Proteômica , Humanos , Proteômica/métodos , Multiômica , Estudos de CoortesRESUMO
Genomic imprinting results in gene expression bias caused by parental chromosome of origin and occurs in genes with important roles during human brain development. However, the cell-type and temporal specificity of imprinting during human neurogenesis is generally unknown. By detecting within-donor allelic biases in chromatin accessibility and gene expression that are unrelated to cross-donor genotype, we inferred imprinting in both primary human neural progenitor cells and their differentiated neuronal progeny from up to 85 donors. We identified 43/20 putatively imprinted regulatory elements (IREs) in neurons/progenitors, and 133/79 putatively imprinted genes in neurons/progenitors. Although 10 IREs and 42 genes were shared between neurons and progenitors, most putative imprinting was only detected within specific cell types. In addition to well-known imprinted genes and their promoters, we inferred novel putative IREs and imprinted genes. Consistent with both DNA methylation-based and H3K27me3-based regulation of imprinted expression, some putative IREs also overlapped with differentially methylated or histone-marked regions. Finally, we identified a progenitor-specific putatively imprinted gene overlapping with copy number variation that is associated with uniparental disomy-like phenotypes. Our results can therefore be useful in interpreting the function of variants identified in future parent-of-origin association studies.
Assuntos
Variações do Número de Cópias de DNA , Metilação de DNA , Humanos , Metilação de DNA/genética , Impressão Genômica/genética , Dissomia Uniparental , Diferenciação Celular/genéticaRESUMO
Understanding the function of the human microbiome is important but the development of statistical methods specifically for the microbial gene expression (i.e. metatranscriptomics) is in its infancy. Many currently employed differential expression analysis methods have been designed for different data types and have not been evaluated in metatranscriptomics settings. To address this gap, we undertook a comprehensive evaluation and benchmarking of 10 differential analysis methods for metatranscriptomics data. We used a combination of real and simulated data to evaluate performance (i.e. type I error, false discovery rate and sensitivity) of the following methods: log-normal (LN), logistic-beta (LB), MAST, DESeq2, metagenomeSeq, ANCOM-BC, LEfSe, ALDEx2, Kruskal-Wallis and two-part Kruskal-Wallis. The simulation was informed by supragingival biofilm microbiome data from 300 preschool-age children enrolled in a study of childhood dental disease (early childhood caries, ECC), whereas validations were sought in two additional datasets from the ECC study and an inflammatory bowel disease study. The LB test showed the highest sensitivity in both small and large samples and reasonably controlled type I error. Contrarily, MAST was hampered by inflated type I error. Upon application of the LN and LB tests in the ECC study, we found that genes C8PHV7 and C8PEV7, harbored by the lactate-producing Campylobacter gracilis, had the strongest association with childhood dental disease. This comprehensive model evaluation offers practical guidance for selection of appropriate methods for rigorous analyses of differential expression in metatranscriptomics. Selection of an optimal method increases the possibility of detecting true signals while minimizing the chance of claiming false ones.
Assuntos
Benchmarking , Doenças Estomatognáticas , Criança , Humanos , Pré-Escolar , Biofilmes , Simulação por Computador , Ácido LácticoRESUMO
The three-dimensional arrangement of the human genome comprises a complex network of structural and regulatory chromatin loops important for coordinating changes in transcription during human development. To better understand the mechanisms underlying context-specific 3D chromatin structure and transcription during cellular differentiation, we generated comprehensive in situ Hi-C maps of DNA loops in human monocytes and differentiated macrophages. We demonstrate that dynamic looping events are regulatory rather than structural in nature and uncover widespread coordination of dynamic enhancer activity at preformed and acquired DNA loops. Enhancer-bound loop formation and enhancer activation of preformed loops together form multi-loop activation hubs at key macrophage genes. Activation hubs connect 3.4 enhancers per promoter and exhibit a strong enrichment for activator protein 1 (AP-1)-binding events, suggesting that multi-loop activation hubs involving cell-type-specific transcription factors represent an important class of regulatory chromatin structures for the spatiotemporal control of transcription.
Assuntos
Diferenciação Celular , Montagem e Desmontagem da Cromatina , Cromatina/metabolismo , DNA/metabolismo , Macrófagos/metabolismo , Fator de Transcrição AP-1/metabolismo , Transcrição Gênica , Sítios de Ligação , Linhagem Celular Tumoral , Cromatina/química , Cromatina/genética , DNA/química , DNA/genética , Elementos Facilitadores Genéticos , Regulação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Conformação de Ácido Nucleico , Fenótipo , Ligação Proteica , Fatores de Tempo , Fator de Transcrição AP-1/genéticaRESUMO
Etiologic heterogeneity occurs when distinct sets of events or exposures give rise to different subtypes of disease. Inference about subtype-specific exposure effects from two-phase outcome-dependent sampling data requires adjustment for both confounding and the sampling design. Common approaches to inference for these effects do not necessarily appropriately adjust for these sources of bias, or allow for formal comparisons of effects across different subtypes. Herein, using inverse probability weighting (IPW) to fit a multinomial model is shown to yield valid inference with this sampling design for subtype-specific exposure effects and contrasts thereof. The IPW approach is compared to common regression-based methods for assessing exposure effect heterogeneity using simulations. The methods are applied to estimate subtype-specific effects of various exposures on breast cancer risk in the Carolina Breast Cancer Study.
RESUMO
Interpretation of the function of non-coding risk loci for neuropsychiatric disorders and brain-relevant traits via gene expression and alternative splicing quantitative trait locus (e/sQTL) analyses is generally performed in bulk post-mortem adult tissue. However, genetic risk loci are enriched in regulatory elements active during neocortical differentiation, and regulatory effects of risk variants may be masked by heterogeneity in bulk tissue. Here, we map e/sQTLs, and allele-specific expression in cultured cells representing two major developmental stages, primary human neural progenitors (n = 85) and their sorted neuronal progeny (n = 74), identifying numerous loci not detected in either bulk developing cortical wall or adult cortex. Using colocalization and genetic imputation via transcriptome-wide association, we uncover cell-type-specific regulatory mechanisms underlying risk for brain-relevant traits that are active during neocortical differentiation. Specifically, we identified a progenitor-specific eQTL for CENPW co-localized with common variant associations for cortical surface area and educational attainment.
Assuntos
Proteínas Cromossômicas não Histona/genética , Regulação da Expressão Gênica no Desenvolvimento , Neocórtex/metabolismo , Neurogênese/genética , Neurônios/metabolismo , Locos de Características Quantitativas , Alelos , Doença de Alzheimer/diagnóstico , Doença de Alzheimer/genética , Doença de Alzheimer/metabolismo , Diferenciação Celular , Cromatina/química , Cromatina/metabolismo , Proteínas Cromossômicas não Histona/metabolismo , Mapeamento Cromossômico , Escolaridade , Feminino , Feto , Predisposição Genética para Doença , Genoma Humano , Estudo de Associação Genômica Ampla , Humanos , Masculino , Neocórtex/citologia , Neocórtex/crescimento & desenvolvimento , Células-Tronco Neurais/citologia , Células-Tronco Neurais/metabolismo , Neurônios/citologia , Neuroticismo , Doença de Parkinson/diagnóstico , Doença de Parkinson/genética , Doença de Parkinson/metabolismo , Cultura Primária de Células , Prognóstico , Esquizofrenia/diagnóstico , Esquizofrenia/genética , Esquizofrenia/metabolismo , TranscriptomaRESUMO
The relative proportion of RNA isoforms expressed for a given gene has been associated with disease states in cancer, retinal diseases, and neurological disorders. Examination of relative isoform proportions can help determine biological mechanisms, but such analyses often require a per-gene investigation of splicing patterns. Leveraging large public data sets produced by genomic consortia as a reference, one can compare splicing patterns in a data set of interest with those of a reference panel in which samples are divided into distinct groups, such as tissue of origin, or disease status. We propose A latent Dirichlet model to Compare expressed isoform proportions TO a Reference panel (ACTOR), a latent Dirichlet model with Dirichlet Multinomial observations to compare expressed isoform proportions in a data set to an independent reference panel. We use a variational Bayes procedure to estimate posterior distributions for the group membership of one or more samples. Using the Genotype-Tissue Expression project as a reference data set, we evaluate ACTOR on simulated and real RNA-seq data sets to determine tissue-type classifications of genes. ACTOR is publicly available as an R package at https://github.com/mccabes292/actor.
Assuntos
Teorema de Bayes , Humanos , Isoformas de Proteínas/genética , Isoformas de Proteínas/análise , Isoformas de Proteínas/metabolismo , Análise de Sequência de RNA/métodosRESUMO
MOTIVATION: Deriving biological insights from genomic data commonly requires comparing attributes of selected genomic loci to a null set of loci. The selection of this null set is non-trivial, as it requires careful consideration of potential covariates, a problem that is exacerbated by the non-uniform distribution of genomic features including genes, enhancers, and transcription factor binding sites. Propensity score-based covariate matching methods allow the selection of null sets from a pool of possible items while controlling for multiple covariates; however, existing packages do not operate on genomic data classes and can be slow for large data sets making them difficult to integrate into genomic workflows. RESULTS: To address this, we developed matchRanges, a propensity score-based covariate matching method for the efficient and convenient generation of matched null ranges from a set of background ranges within the Bioconductor framework. AVAILABILITY AND IMPLEMENTATION: Package: https://bioconductor.org/packages/nullranges, Code: https://github.com/nullranges, Documentation: https://nullranges.github.io/nullranges.
Assuntos
Genômica , Software , Genômica/métodos , Genoma , Sequências Reguladoras de Ácido Nucleico , Projetos de PesquisaRESUMO
MOTIVATION: Enrichment analysis is a widely utilized technique in genomic analysis that aims to determine if there is a statistically significant association between two sets of genomic features. To conduct this type of hypothesis testing, an appropriate null model is typically required. However, the null distribution that is commonly used can be overly simplistic and may result in inaccurate conclusions. RESULTS: bootRanges provides fast functions for generation of block bootstrapped genomic ranges representing the null hypothesis in enrichment analysis. As part of a modular workflow, bootRanges offers greater flexibility for computing various test statistics leveraging other Bioconductor packages. We show that shuffling or permutation schemes may result in overly narrow test statistic null distributions and over-estimation of statistical significance, while creating new range sets with a block bootstrap preserves local genomic correlation structure and generates more reliable null distributions. It can also be used in more complex analyses, such as accessing correlations between cis-regulatory elements (CREs) and genes across cell types or providing optimized thresholds, e.g. log fold change (logFC) from differential analysis. AVAILABILITY AND IMPLEMENTATION: bootRanges is freely available in the R/Bioconductor package nullranges hosted at https://bioconductor.org/packages/nullranges.
Assuntos
Genoma , Genômica , Genômica/métodos , SoftwareRESUMO
SUMMARY: Exclusion regions are sections of reference genomes with abnormal pileups of short sequencing reads. Removing reads overlapping them improves biological signal, and these benefits are most pronounced in differential analysis settings. Several labs created exclusion region sets, available primarily through ENCODE and Github. However, the variety of exclusion sets creates uncertainty which sets to use. Furthermore, gap regions (e.g. centromeres, telomeres, short arms) create additional considerations in generating exclusion sets. We generated exclusion sets for the latest human T2T-CHM13 and mouse GRCm39 genomes and systematically assembled and annotated these and other sets in the excluderanges R/Bioconductor data package, also accessible via the BEDbase.org API. The package provides unified access to 82 GenomicRanges objects covering six organisms, multiple genome assemblies, and types of exclusion regions. For human hg38 genome assembly, we recommend hg38.Kundaje.GRCh38_unified_blacklist as the most well-curated and annotated, and sets generated by the Blacklist tool for other organisms. AVAILABILITY AND IMPLEMENTATION: https://bioconductor.org/packages/excluderanges/. Package website: https://dozmorovlab.github.io/excluderanges/.
Assuntos
Genoma Humano , Software , Animais , Humanos , Camundongos , IncertezaRESUMO
Traditional predictive models for transcriptome-wide association studies (TWAS) consider only single nucleotide polymorphisms (SNPs) local to genes of interest and perform parameter shrinkage with a regularization process. These approaches ignore the effect of distal-SNPs or other molecular effects underlying the SNP-gene association. Here, we outline multi-omics strategies for transcriptome imputation from germline genetics to allow more powerful testing of gene-trait associations by prioritizing distal-SNPs to the gene of interest. In one extension, we identify mediating biomarkers (CpG sites, microRNAs, and transcription factors) highly associated with gene expression and train predictive models for these mediators using their local SNPs. Imputed values for mediators are then incorporated into the final predictive model of gene expression, along with local SNPs. In the second extension, we assess distal-eQTLs (SNPs associated with genes not in a local window around it) for their mediation effect through mediating biomarkers local to these distal-eSNPs. Distal-eSNPs with large indirect mediation effects are then included in the transcriptomic prediction model with the local SNPs around the gene of interest. Using simulations and real data from ROS/MAP brain tissue and TCGA breast tumors, we show considerable gains of percent variance explained (1-2% additive increase) of gene expression and TWAS power to detect gene-trait associations. This integrative approach to transcriptome-wide imputation and association studies aids in identifying the complex interactions underlying genetic regulation within a tissue and important risk genes for various traits and disorders.
Assuntos
Biologia Computacional/métodos , Estudo de Associação Genômica Ampla/métodos , Software , Perfilação da Expressão Gênica/métodos , Humanos , Modelos Genéticos , Especificidade de Órgãos/genética , Fenótipo , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Reprodutibilidade dos TestesRESUMO
Expression quantitative trait loci (eQTL) studies are used to understand the regulatory function of non-coding genome-wide association study (GWAS) risk loci, but colocalization alone does not demonstrate a causal relationship of gene expression affecting a trait. Evidence for mediation, that perturbation of gene expression in a given tissue or developmental context will induce a change in the downstream GWAS trait, can be provided by two-sample Mendelian Randomization (MR). Here, we introduce a new statistical method, MRLocus, for Bayesian estimation of the gene-to-trait effect from eQTL and GWAS summary data for loci with evidence of allelic heterogeneity, that is, containing multiple causal variants. MRLocus makes use of a colocalization step applied to each nearly-LD-independent eQTL, followed by an MR analysis step across eQTLs. Additionally, our method involves estimation of the extent of allelic heterogeneity through a dispersion parameter, indicating variable mediation effects from each individual eQTL on the downstream trait. Our method is evaluated against other state-of-the-art methods for estimation of the gene-to-trait mediation effect, using an existing simulation framework. In simulation, MRLocus often has the highest accuracy among competing methods, and in each case provides more accurate estimation of uncertainty as assessed through interval coverage. MRLocus is then applied to five candidate causal genes for mediation of particular GWAS traits, where gene-to-trait effects are concordant with those previously reported. We find that MRLocus's estimation of the causal effect across eQTLs within a locus provides useful information for determining how perturbation of gene expression or individual regulatory elements will affect downstream traits. The MRLocus method is implemented as an R package available at https://mikelove.github.io/mrlocus.
Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Locos de Características Quantitativas/genética , Simulação por Computador , Regulação da Expressão Gênica/genética , Humanos , Desequilíbrio de Ligação , Análise da Randomização Mendeliana , Modelos Genéticos , Transcriptoma/genéticaRESUMO
Chromatin accessibility and gene expression in relevant cell contexts can guide identification of regulatory elements and mechanisms at genome-wide association study (GWAS) loci. To identify regulatory elements that display differential activity across adipocyte differentiation, we performed ATAC-seq and RNA-seq in a human cell model of preadipocytes and adipocytes at days 4 and 14 of differentiation. For comparison, we created a consensus map of ATAC-seq peaks in 11 human subcutaneous adipose tissue samples. We identified 58,387 context-dependent chromatin accessibility peaks and 3,090 context-dependent genes between all timepoint comparisons (log2 fold change>1, FDR<5%) with 15,919 adipocyte- and 18,244 preadipocyte-dependent peaks. Adipocyte-dependent peaks showed increased overlap (60.1%) with Roadmap Epigenomics adipocyte nuclei enhancers compared to preadipocyte-dependent peaks (11.5%). We linked context-dependent peaks to genes based on adipocyte promoter capture Hi-C data, overlap with adipose eQTL variants, and context-dependent gene expression. Of 16,167 context-dependent peaks linked to a gene, 5,145 were linked by two or more strategies to 1,670 genes. Among GWAS loci for cardiometabolic traits, adipocyte-dependent peaks, but not preadipocyte-dependent peaks, showed significant enrichment (LD score regression P<0.005) for waist-to-hip ratio and modest enrichment (P < 0.05) for HDL-cholesterol. We identified 659 peaks linked to 503 genes by two or more approaches and overlapping a GWAS signal, suggesting a regulatory mechanism at these loci. To identify variants that may alter chromatin accessibility between timepoints, we identified 582 variants in 454 context-dependent peaks that demonstrated allelic imbalance in accessibility (FDR<5%), of which 55 peaks also overlapped GWAS variants. At one GWAS locus for palmitoleic acid, rs603424 was located in an adipocyte-dependent peak linked to SCD and exhibited allelic differences in transcriptional activity in adipocytes (P = 0.003) but not preadipocytes (P = 0.09). These results demonstrate that context-dependent peaks and genes can guide discovery of regulatory variants at GWAS loci and aid identification of regulatory mechanisms.
Assuntos
Diferenciação Celular/genética , Cromatina/genética , Expressão Gênica/genética , Locos de Características Quantitativas/genética , Adipócitos/metabolismo , Tecido Adiposo/metabolismo , Alelos , Desequilíbrio Alélico/genética , Sítios de Ligação/genética , Doenças Cardiovasculares/genética , Doenças Cardiovasculares/metabolismo , Cromatina/metabolismo , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Epigenômica/métodos , Técnicas Genéticas , Estudo de Associação Genômica Ampla/métodos , Humanos , Doenças Metabólicas/genética , Doenças Metabólicas/metabolismo , Regiões Promotoras Genéticas/genética , Sequências Reguladoras de Ácido Nucleico/genéticaRESUMO
The NanoString RNA counting assay for formalin-fixed paraffin embedded samples is unique in its sensitivity, technical reproducibility and robustness for analysis of clinical and archival samples. While commercial normalization methods are provided by NanoString, they are not optimal for all settings, particularly when samples exhibit strong technical or biological variation or where housekeeping genes have variable performance across the cohort. Here, we develop and evaluate a more comprehensive normalization procedure for NanoString data with steps for quality control, selection of housekeeping targets, normalization and iterative data visualization and biological validation. The approach was evaluated using a large cohort ($N=\kern0.5em 1649$) from the Carolina Breast Cancer Study, two cohorts of moderate sample size ($N=359$ and$130$) and a small published dataset ($N=12$). The iterative process developed here eliminates technical variation (e.g. from different study phases or sites) more reliably than the three other methods, including NanoString's commercial package, without diminishing biological variation, especially in long-term longitudinal multiphase or multisite cohorts. We also find that probe sets validated for nCounter, such as the PAM50 gene signature, are impervious to batch issues. This work emphasizes that systematic quality control, normalization and visualization of NanoString nCounter data are an imperative component of study design that influences results in downstream analyses.