RESUMO
Multi-gene assays have been widely used to predict the recurrence risk for hormone receptor (HR)-positive breast cancer patients. However, these assays lack explanatory power regarding the underlying mechanisms of the recurrence risk. To address this limitation, we proposed a novel multi-layered knowledge graph neural network for the multi-gene assays. Our model elucidated the regulatory pathways of assay genes and utilized an attention-based graph neural network to predict recurrence risk while interpreting transcriptional subpathways relevant to risk prediction. Evaluation on three multi-gene assays-Oncotype DX, Prosigna, and EndoPredict-using SCAN-B dataset demonstrated the efficacy of our method. Through interpretation of attention weights, we found that all three assays are mainly regulated by signaling pathways driving cancer proliferation especially RTK-ERK-ETS-mediated cell proliferation for breast cancer recurrence. In addition, our analysis highlighted that the important regulatory subpathways remain consistent across different knowledgebases used for constructing the multi-level knowledge graph. Furthermore, through attention analysis, we demonstrated the biological significance and clinical relevance of these subpathways in predicting patient outcomes. The source code is available at http://biohealth.snu.ac.kr/software/ExplainableMLKGNN.
RESUMO
Computational drug repurposing aims to identify new indications for existing drugs by utilizing high-throughput data, often in the form of biomedical knowledge graphs. However, learning on biomedical knowledge graphs can be challenging due to the dominance of genes and a small number of drug and disease entities, resulting in less effective representations. To overcome this challenge, we propose a "semantic multi-layer guilt-by-association" approach that leverages the principle of guilt-by-association - "similar genes share similar functions", at the drug-gene-disease level. Using this approach, our model DREAMwalk: Drug Repurposing through Exploring Associations using Multi-layer random walk uses our semantic information-guided random walk to generate drug and disease-populated node sequences, allowing for effective mapping of both drugs and diseases in a unified embedding space. Compared to state-of-the-art link prediction models, our approach improves drug-disease association prediction accuracy by up to 16.8%. Moreover, exploration of the embedding space reveals a well-aligned harmony between biological and semantic contexts. We demonstrate the effectiveness of our approach through repurposing case studies for breast carcinoma and Alzheimer's disease, highlighting the potential of multi-layer guilt-by-association perspective for drug repurposing on biomedical knowledge graphs.
Assuntos
Reposicionamento de Medicamentos , Reconhecimento Automatizado de Padrão , AprendizagemRESUMO
Patient stratification is a clinically important task because it allows us to establish and develop efficient treatment strategies for particular groups of patients. Molecular subtypes have been successfully defined using transcriptomic profiles, and they are used effectively in clinical practice, e.g., PAM50 subtypes of breast cancer. Survival prediction contributed to understanding diseases and also identifying genes related to prognosis. It is desirable to stratify patients considering these two aspects simultaneously. However, there are no methods for patient stratification that consider molecular subtypes and survival outcomes at once. Here, we propose a methodology to deal with the problem. A genetic algorithm is used to select a gene set from transcriptome data, and their expression quantities are utilized to assign a risk score to each patient. The patients are ordered and stratified according to the score. A gene set was selected by our method on a breast cancer cohort (TCGA-BRCA), and we examined its clinical utility using an independent cohort (SCAN-B). In this experiment, our method was successful in stratifying patients with respect to both molecular subtype and survival outcome. We demonstrated that the orders of patients were consistent across repeated experiments, and prognostic genes were successfully nominated. Additionally, it was observed that the risk score can be used to evaluate the molecular aggressiveness of individual patients.
RESUMO
Cervical lymph node metastasis is the leading cause of poor prognosis in oral tongue squamous cell carcinoma and also occurs in the early stages. The current clinical diagnosis depends on a physical examination that is not enough to determine whether micrometastasis remains. The transcriptome profiling technique has shown great potential for predicting micrometastasis by capturing the dynamic activation state of genes. However, there are several technical challenges in using transcriptome data to model patient conditions: (1) An Insufficient number of samples compared to the number of genes, (2) Complex dependence between genes that govern the cancer phenotype, and (3) Heterogeneity between patients between cohorts that differ geographically and racially. We developed a computational framework to learn the subnetwork representation of the transcriptome to discover network biomarkers and determine the potential of metastasis in early oral tongue squamous cell carcinoma. Our method achieved high accuracy in predicting the potential of metastasis in two geographically and racially different groups of patients. The robustness of the model and the reproducibility of the discovered network biomarkers show great potential as a tool to diagnose lymph node metastasis in early oral cancer.
Assuntos
Biomarcadores Tumorais/biossíntese , Carcinoma de Células Escamosas/metabolismo , Bases de Dados de Ácidos Nucleicos , Regulação Neoplásica da Expressão Gênica , Modelos Biológicos , Neoplasias Bucais/metabolismo , Transcriptoma , Adulto , Idoso , Carcinoma de Células Escamosas/patologia , Feminino , Humanos , Metástase Linfática , Masculino , Pessoa de Meia-Idade , Neoplasias Bucais/patologiaRESUMO
Pharmacogenomics is the study of how genes affect a person's response to drugs. Thus, understanding the effect of drug at the molecular level can be helpful in both drug discovery and personalized medicine. Over the years, transcriptome data upon drug treatment has been collected and several databases compiled before drug treatment cancer cell multi-omics data with drug sensitivity (IC 50, AUC) or time-series transcriptomic data after drug treatment. However, analyzing transcriptome data upon drug treatment is challenging since more than 20,000 genes interact in complex ways. In addition, due to the difficulty of both time-series analysis and multi-omics integration, current methods can hardly perform analysis of databases with different data characteristics. One effective way is to interpret transcriptome data in terms of well-characterized biological pathways. Another way is to leverage state-of-the-art methods for multi-omics data integration. In this paper, we developed Drug Response analysis Integrating Multi-omics and time-series data (DRIM), an integrative multi-omics and time-series data analysis framework that identifies perturbed sub-pathways and regulation mechanisms upon drug treatment. The system takes drug name and cell line identification numbers or user's drug control/treat time-series gene expression data as input. Then, analysis of multi-omics data upon drug treatment is performed in two perspectives. For the multi-omics perspective analysis, IC 50-related multi-omics potential mediator genes are determined by embedding multi-omics data to gene-centric vector space using a tensor decomposition method and an autoencoder deep learning model. Then, perturbed pathway analysis of potential mediator genes is performed. For the time-series perspective analysis, time-varying perturbed sub-pathways upon drug treatment are constructed. Additionally, a network involving transcription factors (TFs), multi-omics potential mediator genes, and perturbed sub-pathways is constructed, and paths to perturbed pathways from TFs are determined by an influence maximization method. To demonstrate the utility of our system, we provide analysis results of sub-pathway regulatory mechanisms in breast cancer cell lines of different drug sensitivity. DRIM is available at: http://biohealth.snu.ac.kr/software/DRIM/.
RESUMO
MOTIVATION: Biological pathway is an important curated knowledge of biological processes. Thus, cancer subtype classification based on pathways will be very useful to understand differences in biological mechanisms among cancer subtypes. However, pathways include only a fraction of the entire gene set, only one-third of human genes in KEGG, and pathways are fragmented. For this reason, there are few computational methods to use pathways for cancer subtype classification. RESULTS: We present an explainable deep-learning model with attention mechanism and network propagation for cancer subtype classification. Each pathway is modeled by a graph convolutional network. Then, a multi-attention-based ensemble model combines several hundreds of pathways in an explainable manner. Lastly, network propagation on pathway-gene network explains why gene expression profiles in subtypes are different. In experiments with five TCGA cancer datasets, our method achieved very good classification accuracies and, additionally, identified subtype-specific pathways and biological functions. AVAILABILITY AND IMPLEMENTATION: The source code is available at http://biohealth.snu.ac.kr/software/GCN_MAE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Neoplasias , Software , Atenção , Humanos , Neoplasias/genética , TranscriptomaRESUMO
MOTIVATION: Biological pathways are extensively used for the analysis of transcriptome data to characterize biological mechanisms underlying various phenotypes. There are a number of computational tools that summarize transcriptome data at the pathway level. However, there is no comparative study on how well these tools produce useful information at the cohort level, enabling comparison of many samples or patients. RESULTS: In this study, we systematically compared and evaluated 13 different pathway activity inference tools based on 5 comparison criteria using pan-cancer data set. This study has two major contributions. First, our study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. The tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metric. Second, we performed extensive evaluations on the performance of these tools. Because different tools use different methods to map samples to the pathway dimension, the tools are evaluated at the pathway level using five comparison criteria. Starting from measuring how well a tool maintains the characteristics of original gene expression values, robustness was also investigated by adding noise into gene expression data. Classification tasks on three clinical variables (tumor versus normal, survival and cancer subtypes) were performed to evaluate the utility of tools for their clinical applications. In addition, the inferred activity values were compared between the tools to see how similar they are along with the scoring schemes they use.
RESUMO
MOTIVATION: Intratumor heterogeneity (ITH) represents the diversity of cell populations that make up cancer tissue. The level of ITH in a tumor is usually measured by a genomic variation profile, such as copy number variation and somatic mutation. However, a recent study has identified ITH at the transcriptome level and suggested that ITH at gene expression levels is useful for predicting prognosis. Measuring ITH levels at the spliceome level is a natural extension. There are serious technical challenges in measuring spliceomic ITH (sITH) from bulk tumor RNA sequencing (RNA-seq) due to the complex splicing patterns. RESULTS: We propose an information-theoretic method to measure the sITH of bulk tumors to overcome the above challenges. This method has been extensively tested in experiments using synthetic data, xenograft tumor data, and TCGA pan-cancer data. As a result, we showed that sITH is closely related to cancer progression and clonal heterogeneity, along with clinically significant features such as cancer stage, survival outcome and PAM50 subtype. As far as we know, it is the first study to define ITH at the spliceome level. This method can greatly improve the understanding of cancer spliceome and has great potential as a diagnostic and prognostic tool.
Assuntos
Biomarcadores Tumorais , Heterogeneidade Genética , Neoplasias/genética , Splicing de RNA , Algoritmos , Biologia Computacional/métodos , Variações do Número de Cópias de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/metabolismo , Reprodutibilidade dos Testes , Análise de Sequência de RNA , SpliceossomosRESUMO
MOTIVATION: Characterizing cancer subclones is crucial for the ultimate conquest of cancer. Thus, a number of bioinformatic tools have been developed to infer heterogeneous tumor populations based on genomic signatures such as mutations and copy number variations. Despite accumulating evidence for the significance of global DNA methylation reprogramming in certain cancer types including myeloid malignancies, none of the bioinformatic tools are designed to exploit subclonally reprogrammed methylation patterns to reveal constituent populations of a tumor. In accordance with the notion of global methylation reprogramming, our preliminary observations on acute myeloid leukemia (AML) samples implied the existence of subclonally occurring focal methylation aberrance throughout the genome. RESULTS: We present PRISM, a tool for inferring the composition of epigenetically distinct subclones of a tumor solely from methylation patterns obtained by reduced representation bisulfite sequencing. PRISM adopts DNA methyltransferase 1-like hidden Markov model-based in silico proofreading for the correction of erroneous methylation patterns. With error-corrected methylation patterns, PRISM focuses on a short individual genomic region harboring dichotomous patterns that can be split into fully methylated and unmethylated patterns. Frequencies of such two patterns form a sufficient statistic for subclonal abundance. A set of statistics collected from each genomic region is modeled with a beta-binomial mixture. Fitting the mixture with expectation-maximization algorithm finally provides inferred composition of subclones. Applying PRISM for two AML samples, we demonstrate that PRISM could infer the evolutionary history of malignant samples from an epigenetic point of view. AVAILABILITY AND IMPLEMENTATION: PRISM is freely available on GitHub (https://github.com/dohlee/prism). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Variações do Número de Cópias de DNA , Metilação de DNA , Epigenômica , Genoma , GenômicaRESUMO
Pathway based analysis of high throughput transcriptome data is a widely used approach to investigate biological mechanisms. Since a pathway consists of multiple functions, the recent approach is to determine condition specific sub-pathways or subpaths. However, there are several challenges. First, few existing methods utilize explicit gene expression information from RNA-seq. More importantly, subpath activity is usually an average of statistical scores, e.g., correlations, of edges in a candidate subpath, which fails to reflect gene expression quantity information. In addition, none of existing methods can handle multiple phenotypes. To address these technical problems, we designed and implemented an algorithm, MIDAS, that determines condition specific subpaths, each of which has different activities across multiple phenotypes. MIDAS utilizes gene expression quantity information fully and the network centrality information to determine condition specific subpaths. To test performance of our tool, we used TCGA breast cancer RNA-seq gene expression profiles with five molecular subtypes. 36 differentially activate subpaths were determined. The utility of our method, MIDAS, was demonstrated in four ways. All 36 subpaths are well supported by the literature information. Subsequently, we showed that these subpaths had a good discriminant power for five cancer subtype classification and also had a prognostic power in terms of survival analysis. Finally, in a performance comparison of MIDAS to a recent subpath prediction method, PATHOME, our method identified more subpaths and much more genes that are well supported by the literature information. AVAILABILITY: http://biohealth.snu.ac.kr/software/MIDAS/.
Assuntos
Algoritmos , Neoplasias da Mama/genética , Mineração de Dados/estatística & dados numéricos , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , RNA Neoplásico/genética , Neoplasias da Mama/classificação , Neoplasias da Mama/metabolismo , Neoplasias da Mama/mortalidade , Mineração de Dados/métodos , Feminino , Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , RNA Neoplásico/metabolismo , Análise de Sequência de RNA , Transdução de Sinais , Software , Análise de Sobrevida , TranscriptomaRESUMO
BACKGROUND: Identifying perturbed pathways in a given condition is crucial in understanding biological phenomena. In addition to identifying perturbed pathways individually, pathway analysis should consider interactions among pathways. Currently available pathway interaction prediction methods are based on the existence of overlapping genes between pathways, protein-protein interaction (PPI) or functional similarities. However, these approaches just consider the pathways as a set of genes, thus they do not take account of topological features. In addition, most of the existing approaches do not handle the explicit gene expression quantity information that is routinely measured by RNA-sequecing. RESULTS: To overcome these technical issues, we developed a new pathway interaction network construction method using PPI, closeness centrality and shortest paths. We tested our approach on three different high-throughput RNA-seq data sets: pregnant mice data to reveal the role of serotonin on beta cell mass, bone-metastatic breast cancer data and autoimmune thyroiditis data to study the role of IFN- α. Our approach successfully identified the pathways reported in the original papers. For the pathways that are not directly mentioned in the original papers, we were able to find evidences of pathway interactions by the literature search. Our method outperformed two existing approaches, overlapping gene-based approach (OGB) and protein-protein interaction-based approach (PB), in experiments with the three data sets. CONCLUSION: Our results show that PINTnet successfully identified condition-specific perturbed pathways and the interactions between the pathways. We believe that our method will be very useful in characterizing biological mechanisms at the pathway level. PINTnet is available at http://biohealth.snu.ac.kr/software/PINTnet/ .
Assuntos
Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Regulação da Expressão Gênica , Aprendizado de MáquinaRESUMO
Measuring gene expression, DNA sequence variation, and DNA methylation status is routinely done using high throughput sequencing technologies. To analyze such multi-omics data and explore relationships, reliable bioinformatics systems are much needed. Existing systems are either for exploring curated data or for processing omics data in the form of a library such as R. Thus scientists have much difficulty in investigating relationships among gene expression, DNA sequence variation, and DNA methylation using multi-omics data. In this study, we report a system called BioVLAB-mCpG-SNP-EXPRESS for the integrated analysis of DNA methylation, sequence variation (SNPs), and gene expression for distinguishing cellular phenotypes at the pairwise and multiple phenotype levels. The system can be deployed on either the Amazon cloud or a publicly available high-performance computing node, and the data analysis and exploration of the analysis result can be conveniently done using a web-based interface. In order to alleviate analysis complexity, all the process are fully automated, and graphical workflow system is integrated to represent real-time analysis progression. The BioVLAB-mCpG-SNP-EXPRESS system works in three stages. First, it processes and analyzes multi-omics data as input in the form of the raw data, i.e., FastQ files. Second, various integrated analyses such as methylation vs. gene expression and mutation vs. methylation are performed. Finally, the analysis result can be explored in a number of ways through a web interface for the multi-level, multi-perspective exploration. Multi-level interpretation can be done by either gene, gene set, pathway or network level and multi-perspective exploration can be explored from either gene expression, DNA methylation, sequence variation, or their relationship perspective. The utility of the system is demonstrated by performing analysis of phenotypically distinct 30 breast cancer cell line data set. BioVLAB-mCpG-SNP-EXPRESS is available at http://biohealth.snu.ac.kr/software/biovlab_mcpg_snp_express/.
Assuntos
Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Metilação de DNA/genética , Bases de Dados Genéticas , Variação Genética , Humanos , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
AIMS: We compared four common methods for measuring DNA methylation levels and recommended the most efficient method in terms of cost and coverage. MATERIALS & METHODS: The DNA methylation status of liver and stomach tissues was profiled using four different methods, whole-genome bisulphite sequencing (WG-BS), targeted bisulphite sequencing (Targeted-BS), methylated DNA immunoprecipitation sequencing (MeDIP-seq) and methylated DNA immunoprecipitation bisulphite sequencing (MeDIP-BS). We calculated DNA methylation levels using each method and compared the results. RESULTS: MeDIP-BS yielded the most similar DNA methylation profile to WG-BS, with 20 times less data, suggesting remarkable cost savings and coverage efficiency compared with the other methods. CONCLUSION: MeDIP-BS is a practical cost-effective method for analyzing whole-genome DNA methylation that is highly accurate at base-pair resolution.
Assuntos
Metilação de DNA , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias Hepáticas/genética , Análise de Sequência de DNA/métodos , Neoplasias Gástricas/genética , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Análise de Sequência de DNA/normasRESUMO
BACKGROUND: Aberrant epigenetic modifications, including DNA methylation, are key regulators of gene activity in tumorigenesis. Breast cancer is a heterogeneous disease, and large-scale analyses indicate that tumor from normal and benign tissues, as well as molecular subtypes of breast cancer, can be distinguished based on their distinct genomic, transcriptomic, and epigenomic profiles. In this study, we used affinity-based methylation sequencing data in 30 breast cancer cell lines representing functionally distinct cancer subtypes to investigate methylation and mutation patterns at the whole genome level. RESULTS: Our analysis revealed significant differences in CpG island (CpGI) shore methylation and mutation patterns among breast cancer subtypes. In particular, the basal-like B type, a highly aggressive form of the disease, displayed distinct CpGI shore hypomethylation patterns that were significantly associated with downstream gene regulation. We determined that mutation rates at CpG sites were highly correlated with DNA methylation status and observed distinct mutation rates among the breast cancer subtypes. These findings were validated by using targeted bisulfite sequencing of differentially expressed genes (n=85) among the cell lines. CONCLUSIONS: Our results suggest that alterations in DNA methylation play critical roles in gene regulatory process as well as cytosine substitution rates at CpG sites in molecular subtypes of breast cancer.