Search | Virtual Health Library

Toward a gold standard for benchmarking gene set enrichment analysis.

Geistlinger, Ludwig; Csaba, Gergely; Santarelli, Mara; Ramos, Marcel; Schiffer, Lucas; Turaga, Nitesh; Law, Charity; Davis, Sean; Carey, Vincent; Morgan, Martin; Zimmer, Ralf; Waldron, Levi.

Brief Bioinform ; 22(1): 545-556, 2021 01 18.

Article in English | MEDLINE | ID: mdl-32026945

ABSTRACT

MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.

Subject(s)

Gene Expression Profiling/methods , Genomics/methods , RNA-Seq/methods , Animals , Benchmarking , Databases, Genetic/standards , Gene Expression Profiling/standards , Genomics/standards , Humans , RNA-Seq/standards , Software

HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor.

Schiffer, Lucas; Azhar, Rimsha; Shepherd, Lori; Ramos, Marcel; Geistlinger, Ludwig; Huttenhower, Curtis; Dowd, Jennifer B; Segata, Nicola; Waldron, Levi.

Am J Epidemiol ; 188(6): 1023-1026, 2019 06 01.

Article in English | MEDLINE | ID: mdl-30649166

ABSTRACT

Phase 1 of the Human Microbiome Project (HMP) investigated 18 body subsites of 242 healthy American adults to produce the first comprehensive reference for the composition and variation of the "healthy" human microbiome. Publicly available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis. The HMP16SData package provides count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. By removing bioinformatic hurdles of data access and management, HMP16SData enables epidemiologists with only basic R skills to quickly analyze HMP data.

Subject(s)

Databases, Genetic/statistics & numerical data , Microbiota/physiology , RNA, Ribosomal, 16S/metabolism , Adolescent , Adult , Computational Biology , Female , Humans , Male , Young Adult

Accessible, curated metagenomic data through ExperimentHub.

Pasolli, Edoardo; Schiffer, Lucas; Manghi, Paolo; Renson, Audrey; Obenchain, Valerie; Truong, Duy Tin; Beghini, Francesco; Malik, Faizan; Ramos, Marcel; Dowd, Jennifer B; Huttenhower, Curtis; Morgan, Martin; Segata, Nicola; Waldron, Levi.

Nat Methods ; 14(11): 1023-1024, 2017 10 31.

Article in English | MEDLINE | ID: mdl-29088129

Subject(s)

Computational Biology/methods , Metagenomics/methods , Microbiota/genetics , Software , Gastrointestinal Microbiome/genetics , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Genome, Fungal/genetics , Genome, Human/genetics , Humans , Species Specificity

Waldron et al. Reply to "Commentary on the HMP16SData Bioconductor Package".

Waldron, Levi; Schiffer, Lucas; Azhar, Rimsha; Ramos, Marcel; Geistlinger, Ludwig; Segata, Nicola.

Am J Epidemiol ; 188(6): 1031-1032, 2019 06 01.

Article in English | MEDLINE | ID: mdl-30689687

Subject(s)

Microbiota , Software , Genomics , Humans

Multiomic Integration of Public Oncology Databases in Bioconductor.

Ramos, Marcel; Geistlinger, Ludwig; Oh, Sehyun; Schiffer, Lucas; Azhar, Rimsha; Kodali, Hanish; de Bruijn, Ino; Gao, Jianjiong; Carey, Vincent J; Morgan, Martin; Waldron, Levi.

JCO Clin Cancer Inform ; 4: 958-971, 2020 10.

Article in English | MEDLINE | ID: mdl-33119407

ABSTRACT

PURPOSE: Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS: We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS: We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION: These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.

Subject(s)

Computational Biology , Data Management , Databases, Genetic , Genomics , Humans , Software

Multiomic Analysis of Subtype Evolution and Heterogeneity in High-Grade Serous Ovarian Carcinoma.

Geistlinger, Ludwig; Oh, Sehyun; Ramos, Marcel; Schiffer, Lucas; LaRue, Rebecca S; Henzler, Christine M; Munro, Sarah A; Daughters, Claire; Nelson, Andrew C; Winterhoff, Boris J; Chang, Zenas; Talukdar, Shobhana; Shetty, Mihir; Mullany, Sally A; Morgan, Martin; Parmigiani, Giovanni; Birrer, Michael J; Qin, Li-Xuan; Riester, Markus; Starr, Timothy K; Waldron, Levi.

Cancer Res ; 80(20): 4335-4345, 2020 10 15.

Article in English | MEDLINE | ID: mdl-32747365

ABSTRACT

Multiple studies have identified transcriptome subtypes of high-grade serous ovarian carcinoma (HGSOC), but their interpretation and translation are complicated by tumor evolution and polyclonality accompanied by extensive accumulation of somatic aberrations, varying cell type admixtures, and different tissues of origin. In this study, we examined the chronology of HGSOC subtype evolution in the context of these factors using a novel integrative analysis of absolute copy-number analysis and gene expression in The Cancer Genome Atlas complemented by single-cell analysis of six independent tumors. Tumor purity, ploidy, and subclonality were reliably inferred from different genomic platforms, and these characteristics displayed marked differences between subtypes. Genomic lesions associated with HGSOC subtypes tended to be subclonal, implying subtype divergence at later stages of tumor evolution. Subclonality of recurrent HGSOC alterations was evident for proliferative tumors, characterized by extreme genomic instability, absence of immune infiltration, and greater patient age. In contrast, differentiated tumors were characterized by largely intact genome integrity, high immune infiltration, and younger patient age. Single-cell sequencing of 42,000 tumor cells revealed widespread heterogeneity in tumor cell type composition that drove bulk subtypes but demonstrated a lack of intrinsic subtypes among tumor epithelial cells. Our findings prompt the dismissal of discrete transcriptome subtypes for HGSOC and replacement by a more realistic model of continuous tumor development that includes mixtures of subclones, accumulation of somatic aberrations, infiltration of immune and stromal cells in proportions correlated with tumor stage and tissue of origin, and evolution between properties previously associated with discrete subtypes. SIGNIFICANCE: This study infers whether transcriptome-based groupings of tumors differentiate early in carcinogenesis and are, therefore, appropriate targets for therapy and demonstrates that this is not the case for HGSOC.

Subject(s)

Cystadenocarcinoma, Serous/genetics , Cystadenocarcinoma, Serous/pathology , Ovarian Neoplasms/genetics , Ovarian Neoplasms/pathology , Female , Gene Expression Profiling , Genomic Instability , Humans , Ploidies , Single-Cell Analysis

Software for the Integration of Multiomics Experiments in Bioconductor.

Ramos, Marcel; Schiffer, Lucas; Re, Angela; Azhar, Rimsha; Basunia, Azfar; Rodriguez, Carmen; Chan, Tiffany; Chapman, Phil; Davis, Sean R; Gomez-Cabrero, David; Culhane, Aedin C; Haibe-Kains, Benjamin; Hansen, Kasper D; Kodali, Hanish; Louis, Marie S; Mer, Arvind S; Riester, Markus; Morgan, Martin; Carey, Vince; Waldron, Levi.

Cancer Res ; 77(21): e39-e42, 2017 11 01.

Article in English | MEDLINE | ID: mdl-29092936

ABSTRACT

Multiomics experiments are increasingly commonplace in biomedical research and add layers of complexity to experimental design, data integration, and analysis. R and Bioconductor provide a generic framework for statistical analysis and visualization, as well as specialized data classes for a variety of high-throughput data types, but methods are lacking for integrative analysis of multiomics experiments. The MultiAssayExperiment software package, implemented in R and leveraging Bioconductor software and design principles, provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. We provide the unrestricted multiple 'omics data for each cancer tissue in The Cancer Genome Atlas as ready-to-analyze MultiAssayExperiment objects and demonstrate in these and other datasets how the software simplifies data representation, statistical analysis, and visualization. The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable, and reproducible statistical analysis of multiomics data and enhances data science applications of multiple omics datasets. Cancer Res; 77(21); e39-42. ©2017 AACR.

Subject(s)

Genomics , Neoplasms/genetics , Software , Computational Biology , Datasets as Topic , Genome, Human , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL