ABSTRACT
MOTIVATION: Cancer genomes exhibit a large number of different alterations that affect many genes in a diverse manner. An improved understanding of the generative mechanisms behind the mutation rules and their influence on gene community behavior is of great importance for the study of cancer. RESULTS: To expand our capability to analyze combinatorial patterns of cancer alterations, we developed a rigorous methodology for cancer mutation pattern discovery based on a new, constrained form of correlation clustering. Our new algorithm, named C3 (Cancer Correlation Clustering), leverages mutual exclusivity of mutations, patient coverage and driver network concentration principles. To test C3, we performed a detailed analysis on TCGA breast cancer and glioblastoma data and showed that our algorithm outperforms the state-of-the-art CoMEt method in terms of discovering mutually exclusive gene modules and identifying biologically relevant driver genes. The proposed agnostic clustering method represents a unique tool for efficient and reliable identification of mutation patterns and driver pathways in large-scale cancer genomics studies, and it may also be used for other clustering problems on biological graphs. AVAILABILITY AND IMPLEMENTATION: The source code for the C3 method can be found at https://github.com/jackhou2/C3 CONTACTS: jianma@cs.cmu.edu or milenkov@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Breast Neoplasms/genetics , Cluster Analysis , Computational Biology/methods , DNA Mutational Analysis/methods , Glioblastoma/genetics , Female , Gene Regulatory Networks , Humans , MutationABSTRACT
A large number of DNA copy number alterations (CNAs) exist in human breast cancers, and thus characterizing the most frequent CNAs is key to advancing therapeutics because it is likely that these regions contain breast tumor 'drivers' (i.e., cancer causal genes). This study aims to characterize the genomic landscape of breast cancer CNAs and identify potential subtype-specific drivers using a large set of human breast tumors and genetically engineered mouse (GEM) mammary tumors. Using a novel method called SWITCHplus, we identified subtype-specific DNA CNAs occurring at a 15% or greater frequency, which excluded many well-known breast cancer-related drivers such as amplification of ERBB2, and deletions of TP53 and RB1. A comparison of CNAs between mouse and human breast tumors identified regions with shared subtype-specific CNAs. Additional criteria that included gene expression-to-copy number correlation, a DawnRank network analysis, and RNA interference functional studies highlighted candidate driver genes that fulfilled these multiple criteria. Numerous regions of shared CNAs were observed between human breast tumors and GEM mammary tumor models that shared similar gene expression features. Specifically, we identified chromosome 1q21-23 as a Basal-like subtype-enriched region with multiple potential driver genes including PI4KB, SHC1, and NCSTN. This step-wise computational approach based on a cross-species comparison is applicable to any tumor type for which sufficient human and model system DNA copy number data exist, and in this instance, highlights that a single region of amplification may in fact harbor multiple driver genes.
Subject(s)
Breast Neoplasms/genetics , Cell Transformation, Neoplastic/genetics , Chromosome Mapping , Chromosomes, Human, Pair 1 , Oncogenes , Animals , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Computational Biology , DNA Copy Number Variations , Databases, Nucleic Acid , Female , Gene Dosage , Gene Regulatory Networks , Humans , Mice , Neoplasms, Basal Cell/genetics , Neoplasms, Basal Cell/metabolism , Neoplasms, Basal Cell/pathology , Receptors, Notch/genetics , Receptors, Notch/metabolism , Signal Transduction , Species SpecificityABSTRACT
BACKGROUND: Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification. RESULTS: We have developed a new clustering algorithm for cancer subtype identification, called "network-assisted co-clustering for the identification of cancer subtypes" (NCIS). NCIS combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights to genes based on their impact in the network. Then a new weighted co-clustering algorithm based on a semi-nonnegative matrix tri-factorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well as large-scale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering. CONCLUSIONS: The weighted co-clustering approach in NCIS provides a unique solution to incorporate gene network information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data.
Subject(s)
Algorithms , Computational Biology/methods , Gene Expression Profiling/methods , Neoplasms/genetics , Neoplasms/metabolism , Cluster Analysis , Female , Gene Regulatory Networks , HumansABSTRACT
Large-scale cancer genomic studies have revealed that the genetic heterogeneity of the same type of cancer is greater than previously thought. A key question in cancer genomics is the identification of driver genes. Although existing methods have identified many common drivers, it remains challenging to predict personalized drivers to assess rare and even patient-specific mutations. We developed a new algorithm called DawnRank to directly prioritize altered genes on a single patient level. Applications to TCGA datasets demonstrated the effectiveness of our method. We believe DawnRank complements existing driver identification methods and will help us discover personalized causal mutations that would otherwise be obscured by tumor heterogeneity. Source code can be accessed at http://bioen-compbio.bioen.illinois.edu/DawnRank/.