|

1.

Molecular subtypes of high-grade serous ovarian cancer across racial groups and gene expression platforms.

Davidson, Natalie R; Barnard, Mollie E; Hippen, Ariel A; Campbell, Amy; Johnson, Courtney E; Way, Gregory P; Dalley, Brian K; Berchuck, Andrew; Salas, Lucas A; Peres, Lauren C; Marks, Jeffrey R; Schildkraut, Joellen M; Greene, Casey S; Doherty, Jennifer A.

Cancer Epidemiol Biomarkers Prev ; 2024 May 23.

Article En | MEDLINE | ID: mdl-38780898

BACKGROUND: High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by self-identified race may contribute to poorer HGSC survival among Black versus White individuals. METHODS: We included newly generated RNA-Seq data from Black and White individuals, and array-based genotyping data from four existing studies of White and Japanese individuals. We used K-means clustering, a method with no predefined number of clusters or dataset-specific features, to assign subtypes. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. Following mapping to The Cancer Genome Atlas (TCGA)-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. RESULTS: Cluster-specific gene expression was similar across gene expression platforms and racial groups. Comparing the Black population to the White and Japanese populations, the immunoreactive subtype was more common (39% versus 23%-28%) and the differentiated subtype less common (7% versus 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA-Seq data; compared to mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases (Black population HR=0.79 [0.55, 1.13], White population HR=0.86 [0.62, 1.19]). CONCLUSIONS: While the prevalence of HGSC subtypes varied by race, subtype-specific survival was similar. IMPACT: HGSC subtypes can be consistently assigned across platforms and self-identified racial groups.

2.

Molecular subtypes of high-grade serous ovarian cancer across racial groups and gene expression platforms.

Davidson, Natalie R; Barnard, Mollie E; Hippen, Ariel A; Campbell, Amy; Johnson, Courtney E; Way, Gregory P; Dalley, Brian K; Berchuck, Andrew; Salas, Lucas A; Peres, Lauren C; Marks, Jeffrey R; Schildkraut, Joellen M; Greene, Casey S; Doherty, Jennifer A.

bioRxiv ; 2023 Dec 02.

Article En | MEDLINE | ID: mdl-37961178

Introduction: High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by race may contribute to poorer HGSC survival among Black versus non-Hispanic White individuals. Methods: We included newly generated RNA-Seq data from Black and White individuals, and array-based genotyping data from four existing studies of White and Japanese individuals. We assigned subtypes using K-means clustering. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. Following mapping to The Cancer Genome Atlas (TCGA)-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. Results: Cluster-specific gene expression was similar across gene expression platforms. Comparing the Black study population to the White and Japanese study populations, the immunoreactive subtype was more common (39% versus 23%-28%) and the differentiated subtype less common (7% versus 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA-Seq data; compared to mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases (Black population HR=0.79 [0.55, 1.13], White population HR=0.86 [0.62, 1.19]). Conclusions: A single, platform-agnostic pipeline can be used to assign HGSC gene expression subtypes. While the observed prevalence of HGSC subtypes varied by race, subtype-specific survival was similar.

3.

Performance of computational algorithms to deconvolve heterogeneous bulk ovarian tumor tissue depends on experimental factors.

Hippen, Ariel A; Omran, Dalia K; Weber, Lukas M; Jung, Euihye; Drapkin, Ronny; Doherty, Jennifer A; Hicks, Stephanie C; Greene, Casey S.

Genome Biol ; 24(1): 239, 2023 10 20.

Article En | MEDLINE | ID: mdl-37864274

BACKGROUND: Single-cell gene expression profiling provides unique opportunities to understand tumor heterogeneity and the tumor microenvironment. Because of cost and feasibility, profiling bulk tumors remains the primary population-scale analytical strategy. Many algorithms can deconvolve these tumors using single-cell profiles to infer their composition. While experimental choices do not change the true underlying composition of the tumor, they can affect the measurements produced by the assay. RESULTS: We generated a dataset of high-grade serous ovarian tumors with paired expression profiles from using multiple strategies to examine the extent to which experimental factors impact the results of downstream tumor deconvolution methods. We find that pooling samples for single-cell sequencing and subsequent demultiplexing has a minimal effect. We identify dissociation-induced differences that affect cell composition, leading to changes that may compromise the assumptions underlying some deconvolution algorithms. We also observe differences across mRNA enrichment methods that introduce additional discrepancies between the two data types. We also find that experimental factors change cell composition estimates and that the impact differs by method. CONCLUSIONS: Previous benchmarks of deconvolution methods have largely ignored experimental factors. We find that methods vary in their robustness to experimental factors. We provide recommendations for methods developers seeking to produce the next generation of deconvolution approaches and for scientists designing experiments using deconvolution to study tumor heterogeneity.

Gene Expression Profiling , Ovarian Neoplasms , Humans , Female , Gene Expression Profiling/methods , Algorithms , Sequence Analysis, RNA/methods , Ovarian Neoplasms/genetics , Transcriptome , Tumor Microenvironment

4.

wenda_gpu: fast domain adaptation for genomic data.

Hippen, Ariel A; Crawford, Jake; Gardner, Jacob R; Greene, Casey S.

Bioinformatics ; 38(22): 5129-5130, 2022 11 15.

Article En | MEDLINE | ID: mdl-36193991

MOTIVATION: Domain adaptation allows for the development of predictive models even in cases with limited sample data. Weighted elastic net domain adaptation specifically leverages features of genomic data to maximize transferability but the method is too computationally demanding to apply to many genome-sized datasets. RESULTS: We developed wenda_gpu, which uses GPyTorch to train models on genomic data within hours on a single GPU-enabled machine. We show that wenda_gpu returns comparable results to the original wenda implementation, and that it can be used for improved prediction of cancer mutation status on small sample sizes than regular elastic net. AVAILABILITY AND IMPLEMENTATION: wenda_gpu is available on GitHub at https://github.com/greenelab/wenda_gpu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Neoplasms , Software , Humans , Genomics/methods , Neoplasms/genetics , Sample Size

5.

Computational audits combat disparities in recognition.

Hippen, Ariel A; Davidson, Natalie R; Greene, Casey S.

Nat Hum Behav ; 6(4): 473-474, 2022 04.

Article En | MEDLINE | ID: mdl-35039653

Recognition, Psychology , Humans

6.

Genetic demultiplexing of pooled single-cell RNA-sequencing samples in cancer facilitates effective experimental design.

Weber, Lukas M; Hippen, Ariel A; Hickey, Peter F; Berrett, Kristofer C; Gertz, Jason; Doherty, Jennifer Anne; Greene, Casey S; Hicks, Stephanie C.

Gigascience ; 10(9)2021 09 22.

Article En | MEDLINE | ID: mdl-34553212

BACKGROUND: Pooling cells from multiple biological samples prior to library preparation within the same single-cell RNA sequencing experiment provides several advantages, including lower library preparation costs and reduced unwanted technological variation, such as batch effects. Computational demultiplexing tools based on natural genetic variation between individuals provide a simple approach to demultiplex samples, which does not require complex additional experimental procedures. However, to our knowledge these tools have not been evaluated in cancer, where somatic variants, which could differ between cells from the same sample, may obscure the signal in natural genetic variation. RESULTS: Here, we performed in silico benchmark evaluations by combining raw sequencing reads from multiple single-cell samples in high-grade serous ovarian cancer, which has a high copy number burden, and lung adenocarcinoma, which has a high tumor mutational burden. Our results confirm that genetic demultiplexing tools can be effectively deployed on cancer tissue using a pooled experimental design, although high proportions of ambient RNA from cell debris reduce performance. CONCLUSIONS: This strategy provides significant cost savings through pooled library preparation. To facilitate similar analyses at the experimental design phase, we provide freely accessible code and a reproducible Snakemake workflow built around the best-performing tools found in our in silico benchmark evaluations, available at https://github.com/lmweber/snp-dmx-cancer.

Neoplasms , Research Design , Gene Library , High-Throughput Nucleotide Sequencing/methods , Humans , Neoplasms/genetics , RNA , Software

7.

Analysis of scientific society honors reveals disparities.

Le, Trang T; Himmelstein, Daniel S; Hippen, Ariel A; Gazzara, Matthew R; Greene, Casey S.

Cell Syst ; 12(9): 900-906.e5, 2021 09 22.

Article En | MEDLINE | ID: mdl-34555325

Delivering a keynote talk at a conference organized by a scientific society or being named as a fellow by such a society indicates that a scientist is held in high regard by their colleagues. To explore if the distribution of such indicators of esteem in the field of bioinformatics reflects the composition of this field, we compared the gender, name origin, and country of affiliation of 412 honorees from the "International Society for Computational Biology" (75 fellows and 337 keynote speakers) with over 170,000 last authorships on computational biology papers between 1993 and 2019. The proportion of honors bestowed on women was similar to that of the field's overall last authorship rate. However, names of East Asian origin have been persistently underrepresented among honorees. Moreover, there were roughly twice as many honors bestowed on scientists with an affiliation in the United States as expected based on literature authorship. A record of this paper's transparent peer review process is included in the supplemental information.

Computational Biology , Societies, Scientific , Female , Humans , United States

8.

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data.

Hippen, Ariel A; Falco, Matias M; Weber, Lukas M; Erkan, Erdogan Pekcan; Zhang, Kaiyang; Doherty, Jennifer Anne; Vähärautio, Anna; Greene, Casey S; Hicks, Stephanie C.

PLoS Comput Biol ; 17(8): e1009290, 2021 08.

Article En | MEDLINE | ID: mdl-34428202

Single-cell RNA-sequencing (scRNA-seq) has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a 'low-quality' cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA (mtDNA) encoded genes and (ii) if a small number of genes are detected. Current best practices use these QC metrics independently with either arbitrary, uniform thresholds (e.g. 5%) or biological context-dependent (e.g. species) thresholds, and fail to jointly model these metrics in a data-driven manner. Current practices are often overly stringent and especially untenable on certain types of tissues, such as archived tumor tissues, or tissues associated with mitochondrial function, such as kidney tissue [1]. We propose a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset. We demonstrate how our QC metric easily adapts to different types of single-cell datasets to remove low-quality cells while preserving high-quality cells that can be used for downstream analyses. Our software package is available at https://bioconductor.org/packages/miQC.

High-Throughput Nucleotide Sequencing/methods , Probability , Quality Control , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , DNA, Mitochondrial/genetics , Humans

9.

Expanding and Remixing the Metadata Landscape.

Hippen, Ariel A; Greene, Casey S.

Trends Cancer ; 7(4): 276-278, 2021 04.

Article En | MEDLINE | ID: mdl-33229213

Genomic data sharing accelerates research. Data are most valuable when they are accompanied by detailed metadata. To date, metadata are often human-annotated descriptions of samples and their handling. We discuss how machine learning-derived elements complement such descriptions to enhance the research ecosystem around genomic data.

Genomics , Metadata , Humans , Machine Learning , Neoplasms/genetics

10.

Missing something? Codon aversion as a new character system in phylogenetics.

Miller, Justin B; Hippen, Ariel A; Belyeu, Jonathon R; Whiting, Michael F; Ridge, Perry G.

Cladistics ; 33(5): 545-556, 2017 Oct.

Article En | MEDLINE | ID: mdl-34706488

Although many studies have documented codon usage bias in different species, the importance of codon usage in a phylogenetic framework remains largely unknown. We demonstrate that a phylogenetic signal is present in the codon usage and non-usage biases of 17 717 orthologues evaluated across 72 tetrapod species using a simple parsimony analysis of a binary matrix of codon characters. Phylogenies estimated using stop codons were more congruent with previous hypotheses than phylogenies based on any other single codon or a combination of codons. Although each codon is present in every species, specific genes have different codon preferences and may or may not use every possible codon. This observation allowed us to map the pattern of codon usage and non-usage across the topology. These results suggest that codon usage is phylogenetically conserved across shallow and deep levels within tetrapods.

11.

Presenilin E318G variant and Alzheimer's disease risk: the Cache County study.

Hippen, Ariel A; Ebbert, Mark T W; Norton, Maria C; Tschanz, JoAnn T; Munger, Ronald G; Corcoran, Christopher D; Kauwe, John S K.

BMC Genomics ; 17 Suppl 3: 438, 2016 06 29.

Article En | MEDLINE | ID: mdl-27357204

BACKGROUND: Alzheimer's disease is the leading cause of dementia in the elderly and the third most common cause of death in the United States. A vast number of genes regulate Alzheimer's disease, including Presenilin 1 (PSEN1). Multiple studies have attempted to locate novel variants in the PSEN1 gene that affect Alzheimer's disease status. A recent study suggested that one of these variants, PSEN1 E318G (rs17125721), significantly affects Alzheimer's disease status in a large case-control dataset, particularly in connection with the APOEÎµ4 allele. METHODS: Our study looks at the same variant in the Cache County Study on Memory and Aging, a large population-based dataset. We tested for association between E318G genotype and Alzheimer's disease status by running a series of Fisher's exact tests. We also performed logistic regression to test for an additive effect of E318G genotype on Alzheimer's disease status and for the existence of an interaction between E318G and APOEÎµ4. RESULTS: In our Fisher's exact test, it appeared that APOEÎµ4 carriers with an E318G allele have slightly higher risk for AD than those without the allele (3.3 vs. 3.8); however, the 95 % confidence intervals of those estimates overlapped completely, indicating non-significance. Our logistic regression model found a positive but non-significant main effect for E318G (p = 0.895). The interaction term between E318G and APOEÎµ4 was also non-significant (p = 0.689). CONCLUSIONS: Our findings do not provide significant support for E318G as a risk factor for AD in APOEÎµ4 carriers. Our calculations indicated that the overall sample used in the logistic regression models was adequately powered to detect the sort of effect sizes observed previously. However, the power analyses of our Fisher's exact tests indicate that our partitioned data was underpowered, particularly in regards to the low number of E318G carriers, both AD cases and controls, in the Cache county dataset. Thus, the differences in types of datasets used may help to explain the difference in effect magnitudes seen. Analyses in additional case-control datasets will be required to understand fully the effect of E318G on Alzheimer's disease status.

Alzheimer Disease/genetics , Genetic Predisposition to Disease/genetics , Mutation, Missense , Polymorphism, Single Nucleotide , Presenilin-1/genetics , Aged , Aged, 80 and over , Alleles , Apolipoprotein E4/genetics , Case-Control Studies , Epistasis, Genetic , Gene Frequency , Genotype , Humans , Logistic Models , Odds Ratio , Risk Factors , Utah