RESUMO
BACKGROUND: The genetic background of cancer remains complex and challenging to integrate. Many somatic mutations within genes are known to cause and drive cancer, while genome-wide association studies (GWAS) of cancer have revealed many germline risk factors associated with cancer. However, the overlap between known somatic driver genes and positional candidate genes from GWAS loci is surprisingly small. We hypothesised that genes from multiple independent cancer GWAS loci should show tissue-specific co-regulation patterns that converge on cancer-specific driver genes. RESULTS: We studied recent well-powered GWAS of breast, prostate, colorectal and skin cancer by estimating co-expression between genes and subsequently prioritising genes that show significant co-expression with genes mapping within susceptibility loci from cancer GWAS. We observed that the prioritised genes were strongly enriched for cancer drivers defined by COSMIC, IntOGen and Dietlein et al. The enrichment of known cancer driver genes was most significant when using co-expression networks derived from non-cancer samples of the relevant tissue of origin. CONCLUSION: We show how genes within risk loci identified by cancer GWAS can be linked to known cancer driver genes through tissue-specific co-expression networks. This provides an important explanation for why seemingly unrelated sets of genes that harbour either germline risk factors or somatic mutations can eventually cause the same type of disease.
Assuntos
Redes Reguladoras de Genes , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Neoplasias , Humanos , Neoplasias/genética , Especificidade de Órgãos/genética , Regulação Neoplásica da Expressão Gênica , Loci GênicosRESUMO
Background: Gene co-expression networks (GCNs) describe relationships among expressed genes key to maintaining cellular identity and homeostasis. However, the small sample size of typical RNA-seq experiments which is several orders of magnitude fewer than the number of genes is too low to infer GCNs reliably. recount3, a publicly available dataset comprised of 316,443 uniformly processed human RNA-seq samples, provides an opportunity to improve power for accurate network reconstruction and obtain biological insight from the resulting networks. Results: We compared alternate aggregation strategies to identify an optimal workflow for GCN inference by data aggregation and inferred three consensus networks: a universal network, a non-cancer network, and a cancer network in addition to 27 tissue context-specific networks. Central network genes from our consensus networks were enriched for evolutionarily constrained genes and ubiquitous biological pathways, whereas central context-specific network genes included tissue-specific transcription factors and factorization based on the hubs led to clustering of related tissue contexts. We discovered that annotations corresponding to context-specific networks inferred from aggregated data were enriched for trait heritability beyond known functional genomic annotations and were significantly more enriched when we aggregated over a larger number of samples. Conclusion: This study outlines best practices for network GCN inference and evaluation by data aggregation. We recommend estimating and regressing confounders in each data set before aggregation and prioritizing large sample size studies for GCN reconstruction. Increased statistical power in inferring context-specific networks enabled the derivation of variant annotations that were enriched for concordant trait heritability independent of functional genomic annotations that are context-agnostic. While we observed strictly increasing held-out log-likelihood with data aggregation, we noted diminishing marginal improvements. Future directions aimed at alternate methods for estimating confounders and integrating orthogonal information from modalities such as Hi-C and ChIP-seq can further improve GCN inference.