RESUMO
BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting. METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values. RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency. CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.
Assuntos
Doenças Raras , Humanos , Doenças Raras/genética , Doenças Raras/diagnóstico , Genoma Humano/genética , Variação Genética/genética , Biologia Computacional/métodos , FenótipoRESUMO
BACKGROUND: Prostate cancer (PCa) is the most common diagnosed malignancy and the second leading cause of cancer-related deaths among men in the United States. High-throughput genotyping has enabled discovery of germline genetic susceptibility variants (herein referred to as germline mutations) associated with an increased risk of developing PCa. However, germline mutation information has not been leveraged and integrated with information on acquired somatic mutations to link genetic susceptibility to tumorigenesis. The objective of this exploratory study was to address this knowledge gap. METHODS: Germline mutations and associated gene information were derived from genome-wide association studies (GWAS) reports. Somatic mutation and gene expression data were derived from 495 tumors and 52 normal control samples obtained from The Cancer Genome Atlas (TCGA). We integrated germline and somatic mutation information using gene expression data. We performed enrichment analysis to discover molecular networks and biological pathways enriched for germline and somatic mutations. RESULTS: We discovered a signature of 124 genes containing both germline and somatic mutations. Enrichment analysis revealed molecular networks and biological pathways enriched for germline and somatic mutations, including, the PDGF, P53, MYC, IGF-1, PTEN and Androgen receptor signaling pathways. CONCLUSION: Integrative genomic analysis links genetic susceptibility to tumorigenesis in PCa and establishes putative functional bridges between the germline and somatic variation, and the biological pathways they control.
Assuntos
Biomarcadores Tumorais/genética , Redes Reguladoras de Genes , Genômica/métodos , Mutação , Neoplasias da Próstata/genética , Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Mutação em Linhagem Germinativa , Humanos , Fator de Crescimento Insulin-Like I/genética , Masculino , PTEN Fosfo-Hidrolase/genética , Proteínas Proto-Oncogênicas c-myc/genética , Receptores Androgênicos/genética , Proteína Supressora de Tumor p53/genéticaRESUMO
Background: A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting. Methods: Predictors were provided a dataset of phenotype terms and variant calls from GS of 175 RGP individuals (65 families), including 35 solved training set families, with causal variants specified, and 30 test set families (14 solved, 16 unsolved). The challenge tasked teams with identifying the causal variants in as many test set families as possible. Ranked variant predictions were submitted with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on rank position of true positive causal variants and maximum F-measure, based on precision and recall of causal variants across EPCR thresholds. Results: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performing teams recalled the causal variants in up to 13 of 14 solved families by prioritizing high quality variant calls that were rare, predicted deleterious, segregating correctly, and consistent with reported phenotype. In unsolved families, newly discovered diagnostic variants were returned to two families following confirmatory RNA sequencing, and two prioritized novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant, in an unsolved proband with phenotype overlap with asparagine synthetase deficiency. Conclusions: By objective assessment of variant predictions, we provide insights into current state-of-the-art algorithms and platforms for genome sequencing analysis for rare disease diagnosis and explore areas for future optimization. Identification of diagnostic variants in unsolved families promotes synergy between researchers with clinical and computational expertise as a means of advancing the field of clinical genome interpretation.
RESUMO
Developing an accurate and interpretable model to predict an individual's risk for Coronavirus Disease 2019 (COVID-19) is a critical step to efficiently triage testing and other scarce preventative resources. To aid in this effort, we have developed an interpretable risk calculator that utilized de-identified electronic health records (EHR) from the University of Alabama at Birmingham Informatics for Integrating Biology and the Bedside (UAB-i2b2) COVID-19 repository under the U-BRITE framework. The generated risk scores are analogous to commonly used credit scores where higher scores indicate higher risks for COVID-19 infection. By design, these risk scores can easily be calculated in spreadsheets or even with pen and paper. To predict risk, we implemented a Credit Scorecard modeling approach on longitudinal EHR data from 7,262 patients enrolled in the UAB Health System who were evaluated and/or tested for COVID-19 between January and June 2020. In this cohort, 912 patients were positive for COVID-19. Our workflow considered the timing of symptoms and medical conditions and tested the effects by applying different variable selection techniques such as LASSO and Elastic-Net. Within the two weeks before a COVID-19 diagnosis, the most predictive features were respiratory symptoms such as cough, abnormalities of breathing, pain in the throat and chest as well as other chronic conditions including nicotine dependence and major depressive disorder. When extending the timeframe to include all medical conditions across all time, our models also uncovered several chronic conditions impacting the respiratory, cardiovascular, central nervous and urinary organ systems. The whole pipeline of data processing, risk modeling and web-based risk calculator can be applied to any EHR data following the OMOP common data format. The results can be employed to generate questionnaires to estimate COVID-19 risk for screening in building entries or to optimize hospital resources.
RESUMO
BACKGROUND: The recent surge of next generation sequencing of breast cancer genomes has enabled development of comprehensive catalogues of somatic mutations and expanded the molecular classification of subtypes of breast cancer. However, somatic mutations and gene expression data have not been leveraged and integrated with epigenomic data to unravel the genomic-epigenomic interaction landscape of triple negative breast cancer (TNBC) and non-triple negative breast cancer (non-TNBC). METHODS: We performed integrative data analysis combining somatic mutation, epigenomic and gene expression data from The Cancer Genome Atlas (TCGA) to unravel the possible oncogenic interactions between genomic and epigenomic variation in TNBC and non-TNBC. We hypothesized that within breast cancers, there are differences in somatic mutation, DNA methylation and gene expression signatures between TNBC and non-TNBC. We further hypothesized that genomic and epigenomic alterations affect gene regulatory networks and signaling pathways driving the two types of breast cancer. RESULTS: The investigation revealed somatic mutated, epigenomic and gene expression signatures unique to TNBC and non-TNBC and signatures distinguishing the two types of breast cancer. In addition, the investigation revealed molecular networks and signaling pathways enriched for somatic mutations and epigenomic changes unique to each type of breast cancer. The most significant pathways for TNBC were: retinal biosynthesis, BAG2, LXR/RXR, EIF2 and P2Y purigenic receptor signaling pathways. The most significant pathways for non-TNBC were: UVB-induced MAPK, PCP, Apelin endothelial, Endoplasmatic reticulum stress and mechanisms of viral exit from host signaling Pathways. CONCLUSION: The investigation revealed integrated genomic, epigenomic and gene expression signatures and signing pathways unique to TNBC and non-TNBC, and a gene signature distinguishing the two types of breast cancer. The study demonstrates that integrative analysis of multi-omics data is a powerful approach for unravelling the genomic-epigenomic interaction landscape in TNBC and non-TNBC.
RESUMO
Recent advances in high-throughput genotyping and the recent surge of next generation sequencing of the cancer genomes have enabled discovery of germline mutations associated with an increased risk of developing breast cancer and acquired somatic mutations driving the disease. Emerging evidence indicates that germline mutations may interact with somatic mutations to drive carcinogenesis. However, the possible oncogenic interactions and cooperation between germline and somatic alterations in triple-negative breast cancer (TNBC) have not been characterized. The objective of this study was to investigate the possible oncogenic interactions and cooperation between genes containing germline and somatic mutations in TNBC. Our working hypothesis was that genes containing germline mutations associated with an increased risk developing breast cancer also harbor somatic mutations acquired during tumorigenesis, and that these genes are functionally related. We further hypothesized that TNBC originates from a complex interplay among and between genes containing germline and somatic mutations, and that these complex array of interacting genetic factors affect entire molecular networks and biological pathways which in turn drive the disease. We tested this hypothesis by integrating germline mutation information from genome-wide association studies (GWAS) with somatic mutation information on TNBC from The Cancer Genome Atlas (TCGA) using gene expression data from 110 patients with TNBC and 113 controls. We discovered a signature of 237 functionally related genes containing both germline and somatic mutations. We discovered molecular networks and biological pathways enriched for germline and somatic mutations. The top pathways included the hereditary breast cancer and role of BRCA1 in DNA damage response signaling pathways. In conclusion, this is the first large-scale and comprehensive analysis delineating possible oncogenic interactions and cooperation among and between genes containing germline and somatic mutations in TNBC. Genetic and somatic mutations, along with the genes discovered in this study, will require experimental functional validation in different ethnic populations. Functionally validated genetic and somatic variants will have important implications for the development of novel precision prevention strategies and discovery of prognostic markers in TNBC.
Assuntos
Biomarcadores/análise , Regulação Neoplásica da Expressão Gênica , Estudo de Associação Genômica Ampla , Mutação em Linhagem Germinativa , Transcriptoma , Neoplasias de Mama Triplo Negativas/genética , Feminino , HumanosRESUMO
Prostate cancer (PCa) is the most common diagnosed malignancy and the second leading cause of cancer-related deaths among men in the USA. Advances in high-throughput genotyping and next generation sequencing technologies have enabled discovery of germline genetic susceptibility variants and somatic mutations acquired during tumor formation. Emerging evidence indicates that germline variations may interact with somatic events in carcinogenesis. However, the possible oncogenic interactions and cooperation between germline and somatic variation and their role in aggressive PCa remain largely unexplored. Here we investigated the possible oncogenic interactions and cooperation between genes containing germline variation from genome-wide association studies (GWAS) and genes containing somatic mutations from tumor genomes of 305 men with aggressive tumors and 52 control samples from The Cancer Genome Atlas (TCGA). Network and pathway analysis were performed to identify molecular networks and biological pathways enriched for germline and somatic mutations. The analysis revealed 90 functionally related genes containing both germline and somatic mutations. Transcriptome analysis revealed a 61-gene signature containing both germline and somatic mutations. Network analysis revealed molecular networks of functionally related genes and biological pathways including P53, STAT3, NKX3-1, KLK3, and Androgen receptor signaling pathways enriched for germline and somatic mutations. The results show that integrative analysis is a powerful approach to uncovering the possible oncogenic interactions and cooperation between germline and somatic mutations and understanding the broader biological context in which they operate in aggressive PCa.
RESUMO
BACKGROUND: A majority of prostate cancers (PCas) are indolent and cause no harm even without treatment. However, a significant proportion of patients with PCa have aggressive tumors that progress rapidly to metastatic disease and are often lethal. PCa develops through somatic mutagenesis, but emerging evidence suggests that germline genetic variation can markedly contribute to tumorigenesis. However, the causal association between genetic susceptibility and tumorigenesis has not been well characterized. The objective of this study was to map the germline and somatic mutation interaction landscape in indolent and aggressive tumors and to discover signatures of mutated genes associated with each type and distinguishing the two types of PCa. MATERIALS AND METHODS: We integrated germline mutation information from genome-wide association studies (GWAS) with somatic mutation information from The Cancer Genome Atlas (TCGA) using gene expression data from TCGA on indolent and aggressive PCas as the intermediate phenotypes. Germline and somatic mutated genes associated with each type of PCa were functionally characterized using network and pathway analysis. RESULTS: We discovered gene signatures containing germline and somatic mutations associated with each type and distinguishing the two types of PCa. We discovered multiple gene regulatory networks and signaling pathways enriched with germline and somatic mutations including axon guidance, RAR, WINT, MSP-RON, STAT3, PI3K, TR/RxR, and molecular mechanisms of cancer, NF-kB, prostate cancer, GP6, androgen, and VEGF signaling pathways for indolent PCa and MSP-RON, axon guidance, RAR, adipogenesis, and molecular mechanisms of cancer and NF-kB signaling pathways for aggressive PCa. CONCLUSION: The investigation revealed germline and somatic mutated genes associated with indolent and aggressive PCas and distinguishing the two types of PCa. The study revealed multiple gene regulatory networks and signaling pathways dysregulated by germline and somatic alterations. Integrative analysis combining germline and somatic mutations is a powerful approach to mapping germline and somatic mutation interaction landscape.
RESUMO
Triple-negative breast cancer (TNBC) is the most aggressive form of breast cancer. Emerging evidenced suggests that both genetics and epigenetic factors play a role in the pathogenesis of TNBC. However, oncogenic interactions and cooperation between genomic and epigenomic variation have not been characterized. The objective of this study was to deconvolute the genomic and epigenomic interaction landscape in TNBC using an integrative genomics approach, which integrates information on germline, somatic, epigenomic and gene expression variation. We hypothesized that TNBC originates from a complex interplay between genomic (both germline and somatic variation) and epigenomic variation. We further hypothesized that these complex arrays of interacting genomic and epigenomic factors affect entire molecular networks and signaling pathways which, in turn, drive TNBC. We addressed these hypotheses using germline variation from genome-wide association studies and somatic, epigenomic and gene expression variation from The Cancer Genome Atlas (TCGA). The investigation revealed signatures of functionally related genes containing germline, somatic and epigenetic variations. DNA methylation had an effect on gene expression. Network and pathway analysis revealed molecule networks and signaling pathways enriched for germline, somatic and epigenomic variation, among them: Role of BRCA1 in DNA Damage Response, Hereditary Breast Cancer Signaling, Molecular Mechanisms of Cancer, Estrogen-Dependent Breast Cancer, p53, MYC Mediated Apoptosis, and PTEN Signaling pathways. The investigation revealed that integrative genomics is a powerful approach for deconvoluting the genomic-epigenomic interaction landscape in TNBC. Further studies are needed to understand the biological mechanisms underlying oncogenic interactions between genomic and epigenomic factors in TNBC.