Search | VHL Regional Portal

The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data.

de Torrenté, Laurence; Zimmerman, Samuel; Suzuki, Masako; Christopeit, Maximilian; Greally, John M; Mar, Jessica C.

BMC Bioinformatics ; 21(Suppl 21): 562, 2020 Dec 28.

Article in English | MEDLINE | ID: mdl-33371881

ABSTRACT

BACKGROUND: In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). RESULTS: Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. CONCLUSIONS: Our results highlight the value of studying a gene's distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.

Subject(s)

Data Interpretation, Statistical , Gene Expression Profiling , Neoplasms/genetics , Biomarkers, Tumor/genetics , Genomics , Humans , Male , Middle Aged , Neoplasms/diagnosis , Prognosis

A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data.

Raman, Pichai; Zimmerman, Samuel; Rathi, Komal S; de Torrenté, Laurence; Sarmady, Mahdi; Wu, Chao; Leipzig, Jeremy; Taylor, Deanne M; Tozeren, Aydin; Mar, Jessica C.

Cancer Genet ; 235-236: 1-12, 2019 06.

Article in English | MEDLINE | ID: mdl-31296308

ABSTRACT

Identifying genetic biomarkers of patient survival remains a major goal of large-scale cancer profiling studies. Using gene expression data to predict the outcome of a patient's tumor makes biomarker discovery a compelling tool for improving patient care. As genomic technologies expand, multiple data types may serve as informative biomarkers, and bioinformatic strategies have evolved around these different applications. For categorical variables such as a gene's mutation status, biomarker identification to predict survival time is straightforward. However, for continuous variables like gene expression, the available methods generate highly-variable results, and studies on best practices are lacking. We investigated the performance of eight methods that deal specifically with continuous data. K-means, Cox regression, concordance index, D-index, 25th-75th percentile split, median-split, distribution-based splitting, and KaplanScan were applied to four RNA-sequencing (RNA-seq) datasets from the Cancer Genome Atlas. The reliability of the eight methods was assessed by splitting each dataset into two groups and comparing the overlap of the results. Gene sets that had been identified from the literature for a specific tumor type served as positive controls to assess the accuracy of each biomarker using receiver operating characteristic (ROC) curves. Artificial RNA-Seq data were generated to test the robustness of these methods under fixed levels of gene expression noise. Our results show that methods based on dichotomizing tend to have consistently poor performance while C-index, D-index, and k-means perform well in most settings. Overall, the Cox regression method had the strongest performance based on tests of accuracy, reliability, and robustness.

Subject(s)

Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic/genetics , Neoplasms/genetics , Neoplasms/mortality , Base Sequence , Biomarkers, Tumor/genetics , Data Interpretation, Statistical , Humans , Kaplan-Meier Estimate , Prognosis , Proportional Hazards Models , ROC Curve , Sequence Analysis, RNA/methods , Survival Analysis

pathVar: a new method for pathway-based interpretation of gene expression variability.

de Torrente, Laurence; Zimmerman, Samuel; Taylor, Deanne; Hasegawa, Yu; Wells, Christine A; Mar, Jessica C.

PeerJ ; 5: e3334, 2017.

Article in English | MEDLINE | ID: mdl-28560097

ABSTRACT

Identifying the pathways that control a cellular phenotype is the first step to building a mechanistic model. Recent examples in developmental biology, cancer genomics, and neurological disease have demonstrated how changes in the variability of gene expression can highlight important genes that are under different degrees of regulatory control. Simple statistical tests exist to identify differentially-variable genes; however, methods for investigating how changes in gene expression variability in the context of pathways and gene sets are under-explored. Here we present pathVar, a new method that provides functional interpretation of gene expression variability changes at the level of pathways and gene sets. pathVar is based on a multinomial exact test, or an asymptotic Chi-squared test as a more computationally-efficient alternative. The method can be used for gene expression studies from any technology platform in all biological settings either with a single phenotypic group, or two-group comparisons. To demonstrate its utility, we applied the method to a diverse set of diseases, species and samples. Results from pathVar are benchmarked against analyses based on average expression and two methods of GSEA, and demonstrate that analyses using both statistics are useful for understanding transcriptional regulation. We also provide recommendations for the choice of variability statistic that have been informed through analyses on simulations and real data. Based on the datasets selected, we show how pathVar can be used to gain insight into expression variability of single cell versus bulk samples, different stem cell populations, and cancer versus normal tissue comparisons.

Variability of Gene Expression Identifies Transcriptional Regulators of Early Human Embryonic Development.

Hasegawa, Yu; Taylor, Deanne; Ovchinnikov, Dmitry A; Wolvetang, Ernst J; de Torrenté, Laurence; Mar, Jessica C.

PLoS Genet ; 11(8): e1005428, 2015 Aug.

Article in English | MEDLINE | ID: mdl-26288249

ABSTRACT

An analysis of gene expression variability can provide an insightful window into how regulatory control is distributed across the transcriptome. In a single cell analysis, the inter-cellular variability of gene expression measures the consistency of transcript copy numbers observed between cells in the same population. Application of these ideas to the study of early human embryonic development may reveal important insights into the transcriptional programs controlling this process, based on which components are most tightly regulated. Using a published single cell RNA-seq data set of human embryos collected at four-cell, eight-cell, morula and blastocyst stages, we identified genes with the most stable, invariant expression across all four developmental stages. Stably-expressed genes were found to be enriched for those sharing indispensable features, including essentiality, haploinsufficiency, and ubiquitous expression. The stable genes were less likely to be associated with loss-of-function variant genes or human recessive disease genes affected by a DNA copy number variant deletion, suggesting that stable genes have a functional impact on the regulation of some of the basic cellular processes. Genes with low expression variability at early stages of development are involved in regulation of DNA methylation, responses to hypoxia and telomerase activity, whereas by the blastocyst stage, low-variability genes are enriched for metabolic processes as well as telomerase signaling. Based on changes in expression variability, we identified a putative set of gene expression markers of morulae and blastocyst stages. Experimental validation of a blastocyst-expressed variability marker demonstrated that HDDC2 plays a role in the maintenance of pluripotency in human ES and iPS cells. Collectively our analyses identified new regulators involved in human embryonic development that would have otherwise been missed using methods that focus on assessment of the average expression levels; in doing so, we highlight the value of studying expression variability for single cell RNA-seq data.

Subject(s)

Gene Expression Regulation, Developmental , Cells, Cultured , Embryonic Development , Humans , Transcriptome

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL