ABSTRACT
Diversity in the genetic lesions that cause cancer is extreme. In consequence, a pressing challenge is the development of drugs that target patient-specific disease mechanisms. To address this challenge, we employed a chemistry-first discovery paradigm for de novo identification of druggable targets linked to robust patient selection hypotheses. In particular, a 200,000 compound diversity-oriented chemical library was profiled across a heavily annotated test-bed of >100 cellular models representative of the diverse and characteristic somatic lesions for lung cancer. This approach led to the delineation of 171 chemical-genetic associations, shedding light on the targetability of mechanistic vulnerabilities corresponding to a range of oncogenotypes present in patient populations lacking effective therapy. Chemically addressable addictions to ciliogenesis in TTC21B mutants and GLUT8-dependent serine biosynthesis in KRAS/KEAP1 double mutants are prominent examples. These observations indicate a wealth of actionable opportunities within the complex molecular etiology of cancer.
Subject(s)
Carcinoma, Non-Small-Cell Lung/pathology , Cell Proliferation/drug effects , Lung Neoplasms/pathology , Small Molecule Libraries/pharmacology , Carcinoma, Non-Small-Cell Lung/metabolism , Cell Line, Tumor , Cytochrome P450 Family 4/deficiency , Cytochrome P450 Family 4/genetics , Drug Discovery , G1 Phase Cell Cycle Checkpoints/drug effects , Glucocorticoids/pharmacology , Glucose Transport Proteins, Facilitative/antagonists & inhibitors , Glucose Transport Proteins, Facilitative/genetics , Glucose Transport Proteins, Facilitative/metabolism , Humans , Kelch-Like ECH-Associated Protein 1/genetics , Kelch-Like ECH-Associated Protein 1/metabolism , Lung Neoplasms/metabolism , Microtubule-Associated Proteins/genetics , Microtubule-Associated Proteins/metabolism , Mutation , NF-E2-Related Factor 2/antagonists & inhibitors , NF-E2-Related Factor 2/genetics , NF-E2-Related Factor 2/metabolism , Proto-Oncogene Proteins p21(ras)/genetics , Proto-Oncogene Proteins p21(ras)/metabolism , RNA Interference , RNA, Small Interfering/metabolism , Receptor, Notch2/genetics , Receptor, Notch2/metabolism , Receptors, Glucocorticoid/antagonists & inhibitors , Receptors, Glucocorticoid/genetics , Receptors, Glucocorticoid/metabolism , Small Molecule Libraries/chemistry , Small Molecule Libraries/metabolismABSTRACT
Human papillomavirus (HPV) causes 5% of all cancers and frequently integrates into host chromosomes. The HPV oncoproteins E6 and E7 are necessary but insufficient for cancer formation, indicating that additional secondary genetic events are required. Here, we investigate potential oncogenic impacts of virus integration. Analysis of 105 HPV-positive oropharyngeal cancers by whole-genome sequencing detects virus integration in 77%, revealing five statistically significant sites of recurrent integration near genes that regulate epithelial stem cell maintenance (i.e., SOX2, TP63, FGFR, MYC) and immune evasion (i.e., CD274). Genomic copy number hyperamplification is enriched 16-fold near HPV integrants, and the extent of focal host genomic instability increases with their local density. The frequency of genes expressed at extreme outlier levels is increased 86-fold within ±150 kb of integrants. Across 95% of tumors with integration, host gene transcription is disrupted via intragenic integrants, chimeric transcription, outlier expression, gene breaking, and/or de novo expression of noncoding or imprinted genes. We conclude that virus integration can contribute to carcinogenesis in a large majority of HPV-positive oropharyngeal cancers by inducing extensive disruption of host genome structure and gene expression.
Subject(s)
Alphapapillomavirus , Oncogene Proteins, Viral , Oropharyngeal Neoplasms , Alphapapillomavirus/metabolism , Carcinogenesis , Humans , Oncogene Proteins, Viral/genetics , Oropharyngeal Neoplasms/genetics , Papillomaviridae/genetics , Papillomaviridae/metabolism , Papillomavirus E7 Proteins/genetics , Papillomavirus E7 Proteins/metabolism , Virus Integration/geneticsABSTRACT
In chronic lymphocytic leukemia (CLL), epigenetic alterations are considered to centrally shape the transcriptional signatures that drive disease evolution and underlie its biological and clinical subsets. Characterizations of epigenetic regulators, particularly histone-modifying enzymes, are very rudimentary in CLL. In efforts to establish effectors of the CLL-associated oncogene T-cell leukemia 1A (TCL1A), we identified here the lysine-specific histone demethylase KDM1A to interact with the TCL1A protein in B cells in conjunction with an increased catalytic activity of KDM1A. We demonstrate that KDM1A is upregulated in malignant B cells. Elevated KDM1A and associated gene expression signatures correlated with aggressive disease features and adverse clinical outcomes in a large prospective CLL trial cohort. Genetic Kdm1a knockdown in Eµ-TCL1A mice reduced leukemic burden and prolonged animal survival, accompanied by upregulated p53 and proapoptotic pathways. Genetic KDM1A depletion also affected milieu components (T, stromal, and monocytic cells), resulting in significant reductions in their capacity to support CLL-cell survival and proliferation. Integrated analyses of differential global transcriptomes (RNA sequencing) and H3K4me3 marks (chromatin immunoprecipitation sequencing) in Eµ-TCL1A vs iKdm1aKD;Eµ-TCL1A mice (confirmed in human CLL) implicate KDM1A as an oncogenic transcriptional repressor in CLL which alters histone methylation patterns with pronounced effects on defined cell death and motility pathways. Finally, pharmacologic KDM1A inhibition altered H3K4/9 target methylation and revealed marked anti-B-cell leukemic synergisms. Overall, we established the pathogenic role and effector networks of KDM1A in CLL via tumor-cell intrinsic mechanisms and its impacts in cells of the microenvironment. Our data also provide rationales to further investigate therapeutic KDM1A targeting in CLL.
Subject(s)
Leukemia, Lymphocytic, Chronic, B-Cell , Humans , Mice , Animals , Leukemia, Lymphocytic, Chronic, B-Cell/drug therapy , Histones/metabolism , Lysine , Prospective Studies , Histone Demethylases/genetics , Histone Demethylases/metabolism , Tumor MicroenvironmentABSTRACT
OBJECTIVE: We aimed to assess the levels of MDM2-DNA within extracellular vesicles (EVs) isolated from the serum of retroperitoneal liposarcoma (RLS) patients versus healthy donors, as well as within the same patients at the time of surgery versus post-operative surveillance visits. To determine whether EV-MDM2 may serve as a possible first-ever biomarker of liposarcoma recurrence. BACKGROUND: A hallmark of well-differentiated and de-differentiated (WD/DD) retroperitoneal liposarcoma is elevated MDM2 due to genome amplification, with recurrence rates of >50% even after complete resection. Imaging technologies frequently cannot resolve recurrent WD/DD-RLS versus postoperative scarring. Early detection of recurrent lesions, for which biomarkers are lacking, would guide surveillance and treatment decisions. METHODS: WD/DD-RLS serum samples were collected both at the time of surgery and during follow-up visits from 42 patients, along with sera from healthy donors (n=14). EVs were isolated, DNA purified and MDM2-DNA levels determined through q-PCR analysis. Non-parametric tests were employed to compare EV-MDM2 DNA levels from patients versus control group, as well as the time of surgery versus post-surgery conditions. RESULTS: EV-MDM2 levels were significantly higher in WD/DD-RLS than controls (P= 0.00085). Moreover, EV-MDM2 levels were remarkably decreased in WD/DD-RLS patients after resection (P=0.00036), reaching values comparable to control group (P=0.124). During post-operative surveillance, significant increases of EV-MDM2 was observed in some patients, correlating with CT scan evidence of recurrent or persistent post-resection disease. CONCLUSIONS: Serum EV-MDM2 may serve as a potential biomarker of early recurrent or post-operatively persistent WD/DD-RLS, a disease currently lacking such determinants.
ABSTRACT
Acute myeloid leukemia (AML) is a molecularly complex disease characterized by heterogeneous tumor genetic profiles and involving numerous pathogenic mechanisms and pathways. Integration of molecular data types across multiple patient cohorts may advance current genetic approaches for improved subclassification and understanding of the biology of the disease. Here, we analyzed genome-wide DNA methylation in 649 AML patients using Illumina arrays and identified a configuration of 13 subtypes (termed "epitypes") using unbiased clustering. Integration of genetic data revealed that most epitypes were associated with a certain recurrent mutation (or combination) in a majority of patients, yet other epitypes were largely independent. Epitypes showed developmental blockage at discrete stages of myeloid differentiation, revealing epitypes that retain arrested hematopoietic stem-cell-like phenotypes. Detailed analyses of DNA methylation patterns identified unique patterns of aberrant hyper- and hypomethylation among epitypes, with variable involvement of transcription factors influencing promoter, enhancer, and repressed regions. Patients in epitypes with stem-cell-like methylation features showed inferior overall survival along with up-regulated stem cell gene expression signatures. We further identified a DNA methylation signature involving STAT motifs associated with FLT3-ITD mutations. Finally, DNA methylation signatures were stable at relapse for the large majority of patients, and rare epitype switching accompanied loss of the dominant epitype mutations and reversion to stem-cell-like methylation patterns. These results show that DNA methylation-based classification integrates important molecular features of AML to reveal the diverse pathogenic and biological aspects of the disease.
Subject(s)
DNA Methylation , Leukemia, Myeloid, Acute , Humans , Leukemia, Myeloid, Acute/metabolism , Mutation , Promoter Regions, GeneticABSTRACT
MOTIVATION: Clustered regularly interspaced short palindromic repeats (CRISPR)-based genetic perturbation screen is a powerful tool to probe gene function. However, experimental noises, especially for the lowly expressed genes, need to be accounted for to maintain proper control of false positive rate. METHODS: We develop a statistical method, named CRISPR screen with Expression Data Analysis (CEDA), to integrate gene expression profiles and CRISPR screen data for identifying essential genes. CEDA stratifies genes based on expression level and adopts a three-component mixture model for the log-fold change of single-guide RNAs (sgRNAs). Empirical Bayesian prior and expectation-maximization algorithm are used for parameter estimation and false discovery rate inference. RESULTS: Taking advantage of gene expression data, CEDA identifies essential genes with higher expression. Compared to existing methods, CEDA shows comparable reliability but higher sensitivity in detecting essential genes with moderate sgRNA fold change. Therefore, using the same CRISPR data, CEDA generates an additional hit gene list. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Clustered Regularly Interspaced Short Palindromic Repeats , Genes, Essential , Bayes Theorem , CRISPR-Cas Systems , Gene Expression , Reproducibility of Results , RNA, Small Untranslated/geneticsABSTRACT
Human papillomavirus (HPV) is a necessary but insufficient cause of a subset of oral squamous cell carcinomas (OSCCs) that is increasing markedly in frequency. To identify contributory, secondary genetic alterations in these cancers, we used comprehensive genomics methods to compare 149 HPV-positive and 335 HPV-negative OSCC tumor/normal pairs. Different behavioral risk factors underlying the two OSCC types were reflected in distinctive genomic mutational signatures. In HPV-positive OSCCs, the signatures of APOBEC cytosine deaminase editing, associated with anti-viral immunity, were strongly linked to overall mutational burden. In contrast, in HPV-negative OSCCs, T>C substitutions in the sequence context 5'-ATN-3' correlated with tobacco exposure. Universal expression of HPV E6*1 and E7 oncogenes was a sine qua non of HPV-positive OSCCs. Significant enrichment of somatic mutations was confirmed or newly identified in PIK3CA, KMT2D, FGFR3, FBXW7, DDX3X, PTEN, TRAF3, RB1, CYLD, RIPK4, ZNF750, EP300, CASZ1, TAF5, RBL1, IFNGR1, and NFKBIA Of these, many affect host pathways already targeted by HPV oncoproteins, including the p53 and pRB pathways, or disrupt host defenses against viral infections, including interferon (IFN) and nuclear factor kappa B signaling. Frequent copy number changes were associated with concordant changes in gene expression. Chr 11q (including CCND1) and 14q (including DICER1 and AKT1) were recurrently lost in HPV-positive OSCCs, in contrast to their gains in HPV-negative OSCCs. High-ranking variant allele fractions implicated ZNF750, PIK3CA, and EP300 mutations as candidate driver events in HPV-positive cancers. We conclude that virus-host interactions cooperatively shape the unique genetic features of these cancers, distinguishing them from their HPV-negative counterparts.
Subject(s)
Carcinoma, Squamous Cell , Mouth Neoplasms , Neoplasm Proteins , Oncogene Proteins, Viral , Papillomavirus Infections , Carcinoma, Squamous Cell/genetics , Carcinoma, Squamous Cell/metabolism , Carcinoma, Squamous Cell/pathology , Carcinoma, Squamous Cell/virology , Female , Humans , Male , Mouth Neoplasms/genetics , Mouth Neoplasms/metabolism , Mouth Neoplasms/pathology , Mouth Neoplasms/virology , Mutation , Neoplasm Proteins/biosynthesis , Neoplasm Proteins/genetics , Oncogene Proteins, Viral/biosynthesis , Oncogene Proteins, Viral/genetics , Papillomaviridae/genetics , Papillomaviridae/metabolismABSTRACT
SUMMARY: Cytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology. AVAILABILITY AND IMPLEMENTATION: Freely available at https://CRAN.R-project.org/package=RCytoGPS. The code for the underlying CytoGPS software can be found at https://github.com/i2-wustl/CytoGPS.
Subject(s)
Reading , Software , Humans , Karyotyping , KaryotypeABSTRACT
SUMMARY: Unsupervised machine learning provides tools for researchers to uncover latent patterns in large-scale data, based on calculated distances between observations. Methods to visualize high-dimensional data based on these distances can elucidate subtypes and interactions within multi-dimensional and high-throughput data. However, researchers can select from a vast number of distance metrics and visualizations, each with their own strengths and weaknesses. The Mercator R package facilitates selection of a biologically meaningful distance from 10 metrics, together appropriate for binary, categorical and continuous data, and visualization with 5 standard and high-dimensional graphics tools. Mercator provides a user-friendly pipeline for informaticians or biologists to perform unsupervised analyses, from exploratory pattern recognition to production of publication-quality graphics. AVAILABILITYAND IMPLEMENTATION: Mercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html).
ABSTRACT
We present a novel model of time-series analysis to learn from electronic health record (EHR) data when infection occurred in the intensive care unit (ICU) by translating methods from proteomics and Bayesian statistics. Using 48,536 patients hospitalized in an ICU, we describe each hospital course as an 'alphabet' of 23 physician actions ('events') in temporal order. We analyze these as k-mers of length 3-12 events and apply a Bayesian model of (cumulative) relative risk (RR). The log2-transformed RR (median=0.248, mean=0.226) supported the conclusion that the events selected were individually associated with increased risk of infection. Selecting from all possible cutoffs of maximum gain (MG), MG>0.0244 predicts administration of antibiotics with PPV 82.0 %, NPV 44.4 %, and AUC 0.706. Our approach holds value for retrospective analysis of other clinical syndromes for which time-of-onset is critical to analysis but poorly marked in EHRs, including delirium and decompensation.
Subject(s)
Electronic Health Records , Intensive Care Units , Humans , Retrospective Studies , Bayes TheoremABSTRACT
BACKGROUND: There have been many recent breakthroughs in processing and analyzing large-scale data sets in biomedical informatics. For example, the CytoGPS algorithm has enabled the use of text-based karyotypes by transforming them into a binary model. However, such advances are accompanied by new problems of data sparsity, heterogeneity, and noisiness that are magnified by the large-scale multidimensional nature of the data. To address these problems, we developed the Mercator R package, which processes and visualizes binary biomedical data. We use Mercator to address biomedical questions of cytogenetic patterns relating to lymphoid hematologic malignancies, which include a broad set of leukemias and lymphomas. Karyotype data are one of the most common form of genetic data collected on lymphoid malignancies, because karyotyping is part of the standard of care in these cancers. RESULTS: In this paper we combine the analytic power of CytoGPS and Mercator to perform a large-scale multidimensional pattern recognition study on 22,741 karyotype samples in 47 different hematologic malignancies obtained from the public Mitelman database. CONCLUSION: Our findings indicate that Mercator was able to identify both known and novel cytogenetic patterns across different lymphoid malignancies, furthering our understanding of the genetics of these diseases.
Subject(s)
Hematologic Diseases , Karyotyping , Neoplasms , Chromosome Aberrations , Humans , KaryotypeABSTRACT
Alterations in global DNA methylation patterns are a major hallmark of cancer and represent attractive biomarkers for personalized risk stratification. Chronic lymphocytic leukemia (CLL) risk stratification studies typically focus on time to first treatment (TTFT), time to progression (TTP) after treatment, and overall survival (OS). Whereas TTFT risk stratification remains similar over time, TTP and OS have changed dramatically with the introduction of targeted therapies, such as the Bruton tyrosine kinase inhibitor ibrutinib. We have shown that genome-wide DNA methylation patterns in CLL are strongly associated with phenotypic differentiation and patient outcomes. Here, we developed a novel assay, termed methylation-iPLEX (Me-iPLEX), for high-throughput quantification of targeted panels of single cytosine guanine dinucleotides from multiple independent loci. Me-iPLEX was used to classify CLL samples into 1 of 3 known epigenetic subtypes (epitypes). We examined the impact of epitype in 1286 CLL patients from 4 independent cohorts representing a comprehensive view of CLL disease course and therapies. We found that epitype significantly predicted TTFT and OS among newly diagnosed CLL patients. Additionally, epitype predicted TTP and OS with 2 common CLL therapies: chemoimmunotherapy and ibrutinib. Epitype retained significance after stratifying by biologically related biomarkers, immunoglobulin heavy chain mutational status, and ZAP70 expression, as well as other common prognostic markers. Furthermore, among several biological traits enriched between epitypes, we found highly biased immunogenetic features, including IGLV3-21 usage in the poorly characterized intermediate-programmed CLL epitype. In summary, Me-iPLEX is an elegant method to assess epigenetic signatures, including robust classification of CLL epitypes that independently stratify patient risk at diagnosis and time of treatment.
Subject(s)
DNA Methylation , Leukemia, Lymphocytic, Chronic, B-Cell/genetics , Biomarkers, Tumor/genetics , Disease Progression , Epigenesis, Genetic , Genetic Loci , Genetic Testing , Humans , Leukemia, Lymphocytic, Chronic, B-Cell/diagnosis , PrognosisABSTRACT
INTRODUCTION: Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data. METHODS: We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), k-medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW). We applied our best methods to two real-world data sets: (1) 21 features collected on 247 patients with chronic lymphocytic leukemia, and (2) 40 features collected on 6000 patients admitted to an intensive care unit. RESULTS: HC outperformed k-medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to other methods, DAISY with HC uncovered superior, separable clusters in both real-world data sets. DISCUSSION: Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.
Subject(s)
Leukemia, Lymphocytic, Chronic, B-Cell , Algorithms , Cluster Analysis , Computer Simulation , HumansABSTRACT
BACKGROUND: In the intensive care unit (ICU), delirium is a common, acute, confusional state associated with high risk for short- and long-term morbidity and mortality. Machine learning (ML) has promise to address research priorities and improve delirium outcomes. However, due to clinical and billing conventions, delirium is often inconsistently or incompletely labeled in electronic health record (EHR) datasets. Here, we identify clinical actions abstracted from clinical guidelines in electronic health records (EHR) data that indicate risk of delirium among intensive care unit (ICU) patients. We develop a novel prediction model to label patients with delirium based on a large data set and assess model performance. METHODS: EHR data on 48,451 admissions from 2001 to 2012, available through Medical Information Mart for Intensive Care-III database (MIMIC-III), was used to identify features to develop our prediction models. Five binary ML classification models (Logistic Regression; Classification and Regression Trees; Random Forests; Naïve Bayes; and Support Vector Machines) were fit and ranked by Area Under the Curve (AUC) scores. We compared our best model with two models previously proposed in the literature for goodness of fit, precision, and through biological validation. RESULTS: Our best performing model with threshold reclassification for predicting delirium was based on a multiple logistic regression using the 31 clinical actions (AUC 0.83). Our model out performed other proposed models by biological validation on clinically meaningful, delirium-associated outcomes. CONCLUSIONS: Hurdles in identifying accurate labels in large-scale datasets limit clinical applications of ML in delirium. We developed a novel labeling model for delirium in the ICU using a large, public data set. By using guideline-directed clinical actions independent from risk factors, treatments, and outcomes as model predictors, our classifier could be used as a delirium label for future clinically targeted models.
Subject(s)
Delirium , Intensive Care Units , Bayes Theorem , Delirium/diagnosis , Electronic Health Records , Humans , Machine LearningABSTRACT
MOTIVATION: Clonal heterogeneity is common in many types of cancer, including chronic lymphocytic leukemia (CLL). Previous research suggests that the presence of multiple distinct cancer clones is associated with clinical outcome. Detection of clonal heterogeneity from high throughput data, such as sequencing or single nucleotide polymorphism (SNP) array data, is important for gaining a better understanding of cancer and may improve prediction of clinical outcome or response to treatment. Here, we present a new method, CloneSeeker, for inferring clinical heterogeneity from sequencing data, SNP array data, or both. RESULTS: We generated simulated SNP array and sequencing data and applied CloneSeeker along with two other methods. We demonstrate that CloneSeeker is more accurate than existing algorithms at determining the number of clones, distribution of cancer cells among clones, and mutation and/or copy numbers belonging to each clone. Next, we applied CloneSeeker to SNP array data from samples of 258 previously untreated CLL patients to gain a better understanding of the characteristics of CLL tumors and to elucidate the relationship between clonal heterogeneity and clinical outcome. We found that a significant majority of CLL patients appear to have multiple clones distinguished by copy number alterations alone. We also found that the presence of multiple clones corresponded with significantly worse survival among CLL patients. These findings may prove useful for improving the accuracy of prognosis and design of treatment strategies. AVAILABILITY AND IMPLEMENTATION: Code available on R-Forge: https://r-forge.r-project.org/projects/CloneSeeker/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Leukemia, Lymphocytic, Chronic, B-Cell , Polymorphism, Single Nucleotide , Whole Genome Sequencing , Algorithms , DNA Copy Number Variations , Female , High-Throughput Nucleotide Sequencing , Humans , MaleABSTRACT
SUMMARY: Karyotype data are the most common form of genetic data that is regularly used clinically. They are collected as part of the standard of care in many diseases, particularly in pediatric and cancer medicine contexts. Karyotypes are represented in a unique text-based format, with a syntax defined by the International System for human Cytogenetic Nomenclature (ISCN). While human-readable, ISCN is not intrinsically machine-readable. This limitation has prevented the full use of complex karyotype data in discovery science use cases. To enhance the utility and value of karyotype data, we developed a tool named CytoGPS. CytoGPS first parses ISCN karyotypes into a machine-readable format. It then converts the ISCN karyotype into a binary Loss-Gain-Fusion (LGF) model, which represents all cytogenetic abnormalities as combinations of loss, gain, or fusion events, in a format that is analyzable using modern computational methods. Such data is then made available for comprehensive 'downstream' analyses that previously were not feasible. AVAILABILITY AND IMPLEMENTATION: Freely available at http://cytogps.org.
Subject(s)
Chromosome Aberrations , Karyotype , Humans , Karyotyping , Neoplasms , SoftwareABSTRACT
BACKGROUND: Fludarabine, cyclophosphamide, and rituximab (FCR) has become a gold-standard chemoimmunotherapy regimen for patients with chronic lymphocytic leukaemia. However, the question remains of how to treat treatment-naive patients with IGHV-unmutated chronic lymphocytic leukaemia. We therefore aimed to develop and validate a gene expression signature to identify which of these patients are likely to achieve durable remissions with FCR chemoimmunotherapy. METHODS: We did a retrospective cohort study in two cohorts of treatment-naive patients (aged ≥18 years) with chronic lymphocytic leukaemia. The discovery and training cohort consisted of peripheral blood samples collected from patients treated at the University of Texas MD Anderson Cancer Center (Houston, TX, USA), who fulfilled the diagnostic criteria of the International Workshop on Chronic Lymphocytic Leukemia, had received at least three cycles of FCR chemoimmunotherapy, and had been treated between Oct 10, 2000, and Oct 26, 2006 (ie, the MDACC cohort). We did transcriptional profiling on samples obtained from the MDACC cohort to identify genes associated with time to progression. We did univariate Cox proportional hazards analyses and used significant genes to cluster IGHV-unmutated samples into two groups (intermediate prognosis and unfavourable prognosis). After using cross-validation to assess robustness, we applied the Lasso method to standardise the gene expression values to find a minimum gene signature. We validated this signature in an external cohort of treatment-naive patients with IGHV-unmutated chronic lymphocytic leukaemia enrolled on the CLL8 trial of the German Chronic Lymphocytic Leukaemia Study Group who were treated between July 21, 2003, and April 4, 2006 (ie, the CLL8 cohort). FINDINGS: The MDACC cohort consisted of 101 patients and the CLL8 cohort consisted of 109 patients. Using the MDACC cohort, we identified and developed a 17-gene expression signature that distinguished IGHV-unmutated patients who were likely to achieve a long-term remission following front-line FCR chemoimmunotherapy from those who might benefit from alternative front-line regimens (hazard ratio 3·83, 95% CI 1·94-7·59; p<0·0001). We validated this gene signature in the CLL8 cohort; patients with an unfavourable prognosis versus those with an intermediate prognosis had a cause-specific hazard ratio of 1·90 (95% CI 1·18-3·06; p=0·008). Median time to progression was 39 months (IQR 22-69) for those with an unfavourable prognosis compared with 59 months (28-84) for those with an intermediate prognosis. INTERPRETATION: We have developed a robust, reproducible 17-gene signature that identifies a subset of treatment-naive patients with IGHV-unmutated chronic lymphocytic leukaemia who might substantially benefit from treatment with FCR chemoimmunotherapy. We recommend testing the value of this gene signature in a prospective study that compares FCR treatment with newer alternative therapies as part of a randomised clinical trial. FUNDING: Chronic Lymphocytic Leukaemia Global Research Foundation and the National Institutes of Health/National Cancer Institute.
Subject(s)
Antineoplastic Agents, Immunological/administration & dosage , Antineoplastic Combined Chemotherapy Protocols/administration & dosage , Cyclophosphamide/administration & dosage , Gene Expression Profiling , Leukemia, Lymphocytic, Chronic, B-Cell/drug therapy , Rituximab/administration & dosage , Transcriptome , Vidarabine/analogs & derivatives , Aged , Antineoplastic Agents, Immunological/adverse effects , Antineoplastic Combined Chemotherapy Protocols/adverse effects , Cyclophosphamide/adverse effects , Disease Progression , Female , Germany , Humans , Leukemia, Lymphocytic, Chronic, B-Cell/genetics , Leukemia, Lymphocytic, Chronic, B-Cell/immunology , Leukemia, Lymphocytic, Chronic, B-Cell/pathology , Male , Middle Aged , Predictive Value of Tests , Remission Induction , Risk Assessment , Risk Factors , Rituximab/adverse effects , Texas , Time Factors , Treatment Outcome , Vidarabine/administration & dosage , Vidarabine/adverse effectsABSTRACT
Posttranslational histone tail modifications are known to play a role in leukemogenesis and are therapeutic targets. A global analysis of the level and patterns of expression of multiple histone-modifying proteins (HMP) in acute myeloid leukemia (AML) and the effect of different patterns of expression on outcome and prognosis has not been investigated in AML patients. Here we analyzed 20 HMP by reverse phase protein array (RPPA) in a cohort of 205 newly diagnosed AML patients. Protein levels were correlated with patient and disease characteristics, including survival and mutational state. We identified different protein clusters characterized by higher (more on) or lower (more off) expression of HMP, relative to normal CD34+ cells. On state of HMP was associated with poorer outcome compared to normal-like and a more off state. FLT3 mutated AML patients were significantly overrepresented in the more on state. DNA methylation related mutations showed no correlation with the different HMP states. In this study, we demonstrate for the first time that HMP form recurrent patterns of expression and that these significantly correlate with survival in newly diagnosed AML patients.
Subject(s)
Gene Expression Regulation, Leukemic , Histone Code , Leukemia, Myeloid, Acute/genetics , Adult , Aged , DNA Methylation , Female , Humans , Leukemia, Myeloid, Acute/diagnosis , Leukemia, Myeloid, Acute/metabolism , Male , Middle Aged , Prognosis , Protein Array Analysis , Protein Interaction Maps , Survival AnalysisABSTRACT
BACKGROUND: Cluster analysis is the most common unsupervised method for finding hidden groups in data. Clustering presents two main challenges: (1) finding the optimal number of clusters, and (2) removing "outliers" among the objects being clustered. Few clustering algorithms currently deal directly with the outlier problem. Furthermore, existing methods for identifying the number of clusters still have some drawbacks. Thus, there is a need for a better algorithm to tackle both challenges. RESULTS: We present a new approach, implemented in an R package called Thresher, to cluster objects in general datasets. Thresher combines ideas from principal component analysis, outlier filtering, and von Mises-Fisher mixture models in order to select the optimal number of clusters. We performed a large Monte Carlo simulation study to compare Thresher with other methods for detecting outliers and determining the number of clusters. We found that Thresher had good sensitivity and specificity for detecting and removing outliers. We also found that Thresher is the best method for estimating the optimal number of clusters when the number of objects being clustered is smaller than the number of variables used for clustering. Finally, we applied Thresher and eleven other methods to 25 sets of breast cancer data downloaded from the Gene Expression Omnibus; only Thresher consistently estimated the number of clusters to lie in the range of 4-7 that is consistent with the literature. CONCLUSIONS: Thresher is effective at automatically detecting and removing outliers. By thus cleaning the data, it produces better estimates of the optimal number of clusters when there are more variables than objects. When we applied Thresher to a variety of breast cancer datasets, it produced estimates that were both self-consistent and consistent with the literature. We expect Thresher to be useful for studying a wide variety of biological datasets.
Subject(s)
Cluster Analysis , Algorithms , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Female , Humans , Monte Carlo Method , Principal Component AnalysisABSTRACT
BACKGROUND: Integration of transcriptomic and metabolomic data improves functional interpretation of disease-related metabolomic phenotypes, and facilitates discovery of putative metabolite biomarkers and gene targets. For this reason, these data are increasingly collected in large (> 100 participants) cohorts, thereby driving a need for the development of user-friendly and open-source methods/tools for their integration. Of note, clinical/translational studies typically provide snapshot (e.g. one time point) gene and metabolite profiles and, oftentimes, most metabolites measured are not identified. Thus, in these types of studies, pathway/network approaches that take into account the complexity of transcript-metabolite relationships may neither be applicable nor readily uncover novel relationships. With this in mind, we propose a simple linear modeling approach to capture disease-(or other phenotype) specific gene-metabolite associations, with the assumption that co-regulation patterns reflect functionally related genes and metabolites. RESULTS: The proposed linear model, metabolite ~ gene + phenotype + gene:phenotype, specifically evaluates whether gene-metabolite relationships differ by phenotype, by testing whether the relationship in one phenotype is significantly different from the relationship in another phenotype (via a statistical interaction gene:phenotype p-value). Statistical interaction p-values for all possible gene-metabolite pairs are computed and significant pairs are then clustered by the directionality of associations (e.g. strong positive association in one phenotype, strong negative association in another phenotype). We implemented our approach as an R package, IntLIM, which includes a user-friendly R Shiny web interface, thereby making the integrative analyses accessible to non-computational experts. We applied IntLIM to two previously published datasets, collected in the NCI-60 cancer cell lines and in human breast tumor and non-tumor tissue, for which transcriptomic and metabolomic data are available. We demonstrate that IntLIM captures relevant tumor-specific gene-metabolite associations involved in known cancer-related pathways, including glutamine metabolism. Using IntLIM, we also uncover biologically relevant novel relationships that could be further tested experimentally. CONCLUSIONS: IntLIM provides a user-friendly, reproducible framework to integrate transcriptomic and metabolomic data and help interpret metabolomic data and uncover novel gene-metabolite relationships. The IntLIM R package is publicly available in GitHub ( https://github.com/mathelab/IntLIM ) and includes a user-friendly web application, vignettes, sample data and data/code to reproduce results.