Search | VHL Regional Portal

1.

The tidyomics ecosystem: enhancing omic data analyses.

Hutchison, William J; Keyes, Timothy J; Crowell, Helena L; Serizay, Jacques; Soneson, Charlotte; Davis, Eric S; Sato, Noriaki; Moses, Lambda; Tarlinton, Boyd; Nahid, Abdullah A; Kosmac, Miha; Clayssen, Quentin; Yuan, Victor; Mu, Wancen; Park, Ji-Eun; Mamede, Izabela; Ryu, Min Hyung; Axisa, Pierre-Paul; Paiz, Paulina; Poon, Chi-Lam; Tang, Ming; Gottardo, Raphael; Morgan, Martin; Lee, Stuart; Lawrence, Michael; Hicks, Stephanie C; Nolan, Garry P; Davis, Kara L; Papenfuss, Anthony T; Love, Michael I; Mangiola, Stefano.

Nat Methods ; 2024 Jun 14.

Article in English | MEDLINE | ID: mdl-38877315

ABSTRACT

The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.

2.

The tidyomics ecosystem: Enhancing omic data analyses.

Hutchison, William J; Keyes, Timothy J; Crowell, Helena L; Serizay, Jacques; Soneson, Charlotte; Davis, Eric S; Sato, Noriaki; Moses, Lambda; Tarlinton, Boyd; Nahid, Abdullah A; Kosmac, Miha; Clayssen, Quentin; Yuan, Victor; Mu, Wancen; Park, Ji-Eun; Mamede, Izabela; Ryu, Min Hyung; Axisa, Pierre-Paul; Paiz, Paulina; Poon, Chi-Lam; Tang, Ming; Gottardo, Raphael; Morgan, Martin; Lee, Stuart; Lawrence, Michael; Hicks, Stephanie C; Nolan, Garry P; Davis, Kara L; Papenfuss, Anthony T; Love, Michael I; Mangiola, Stefano.

bioRxiv ; 2024 May 22.

Article in English | MEDLINE | ID: mdl-38826347

ABSTRACT

The growth of omic data presents evolving challenges in data manipulation, analysis, and integration. Addressing these challenges, Bioconductor1 provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming2 offers a revolutionary standard for data organisation and manipulation. Here, we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning, and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analysing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas3, spanning six data frameworks and ten analysis tools.

3.

AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines.

Oh, Sehyun; Gravel-Pucillo, Kai; Ramos, Marcel; Davis, Sean; Carey, Vince; Morgan, Martin; Waldron, Levi.

Res Sq ; 2024 May 15.

Article in English | MEDLINE | ID: mdl-38798429

ABSTRACT

Advancements in sequencing technologies and the development of new data collection methods produce large volumes of biological data. The Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based platform for democratizing access to large-scale genomics data and analysis tools. However, utilizing the full capabilities of AnVIL can be challenging for researchers without extensive bioinformatics expertise, especially for executing complex workflows. Here we present the AnVILWorkflow R package, which enables the convenient execution of bioinformatics workflows hosted on AnVIL directly from an R environment. AnVILWorkflowsimplifies the setup of the cloud computing environment, input data formatting, workflow submission, and retrieval of results through intuitive functions. We demonstrate the utility of AnVILWorkflowfor three use cases: bulk RNA-seq analysis with Salmon, metagenomics analysis with bioBakery, and digital pathology image processing with PathML. The key features of AnVILWorkflow include user-friendly browsing of available data and workflows, seamless integration of R and non-R tools within a reproducible analysis pipeline, and accessibility to scalable computing resources without direct management overhead. While some limitations exist around workflow customization, AnVILWorkflowlowers the barrier to taking advantage of AnVIL's resources, especially for exploratory analyses or bulk processing with established workflows. This empowers a broader community of researchers to leverage the latest genomics tools and datasets using familiar R syntax. This package is distributed through the Bioconductor project (https://bioconductor.org/packages/AnVILWorkflow), and the source code is available through GitHub (https://github.com/shbrief/AnVILWorkflow).

4.

ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management.

Liu, Qian; Hu, Qiang; Liu, Song; Hutson, Alan; Morgan, Martin.

BMC Bioinformatics ; 25(1): 8, 2024 Jan 03.

Article in English | MEDLINE | ID: mdl-38172657

ABSTRACT

BACKGROUND: The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. RESULTS: Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. CONCLUSIONS: ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).

Subject(s)

Data Management , Genomics , Software , Programming Languages , Workflow

5.

Curated single cell multimodal landmark datasets for R/Bioconductor.

Eckenrode, Kelly B; Righelli, Dario; Ramos, Marcel; Argelaguet, Ricard; Vanderaa, Christophe; Geistlinger, Ludwig; Culhane, Aedin C; Gatto, Laurent; Carey, Vincent; Morgan, Martin; Risso, Davide; Waldron, Levi.

PLoS Comput Biol ; 19(8): e1011324, 2023 08.

Article in English | MEDLINE | ID: mdl-37624866

ABSTRACT

BACKGROUND: The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes. RESULTS: We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor's Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor's ecosystem of hundreds of packages for single-cell and multimodal data. CONCLUSIONS: We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.

Subject(s)

Ecosystem , Proteomics , Cell Differentiation , Computational Biology , Epigenomics

6.

Evaluation of software impact designed for biomedical research: Are we measuring what's meaningful?

Afiaz, Awan; Ivanov, Andrey A; Chamberlin, John; Hanauer, David; Savonen, Candace L; Goldman, Mary J; Morgan, Martin; Reich, Michael; Getka, Alexander; Holmes, Aaron; Pati, Sarthak; Knight, Dan; Boutros, Paul C; Bakas, Spyridon; Caporaso, J Gregory; Del Fiol, Guilherme; Hochheiser, Harry; Haas, Brian; Schloss, Patrick D; Eddy, James A; Albrecht, Jake; Fedorov, Andrey; Waldron, Levi; Hoffman, Ava M; Bradshaw, Richard L; Leek, Jeffrey T; Wright, Carrie.

ArXiv ; 2023 Jun 05.

Article in English | MEDLINE | ID: mdl-37332562

ABSTRACT

Software is vital for the advancement of biology and medicine. Through analysis of usage and impact metrics of software, developers can help determine user and community engagement. These metrics can be used to justify additional funding, encourage additional use, and identify unanticipated use cases. Such analyses can help define improvement areas and assist with managing project resources. However, there are challenges associated with assessing usage and impact, many of which vary widely depending on the type of software being evaluated. These challenges involve issues of distorted, exaggerated, understated, or misleading metrics, as well as ethical and security concerns. More attention to the nuances, challenges, and considerations involved in capturing impact across the diverse spectrum of biological software is needed. Furthermore, some tools may be especially beneficial to a small audience, yet may not have comparatively compelling metrics of high usage. Although some principles are generally applicable, there is not a single perfect metric or approach to effectively evaluate a software tool's impact, as this depends on aspects unique to each tool, how it is used, and how one wishes to evaluate engagement. We propose more broadly applicable guidelines (such as infrastructure that supports the usage of software and the collection of metrics about usage), as well as strategies for various types of software and resources. We also highlight outstanding issues in the field regarding how communities measure or evaluate software impact. To gain a deeper understanding of the issues hindering software evaluations, as well as to determine what appears to be helpful, we performed a survey of participants involved with scientific software projects for the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We also investigated software among this scientific community and others to assess how often infrastructure that supports such evaluations is implemented and how this impacts rates of papers describing usage of the software. We find that although developers recognize the utility of analyzing data related to the impact or usage of their software, they struggle to find the time or funding to support such analyses. We also find that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seem to be associated with increased usage rates. Our findings can help scientific software developers make the most out of the evaluations of their software so that they can more fully benefit from such assessments.

7.

RaggedExperiment: the missing link between genomic ranges and matrices in Bioconductor.

Ramos, Marcel; Morgan, Martin; Geistlinger, Ludwig; Carey, Vincent J; Waldron, Levi.

Bioinformatics ; 39(6)2023 06 01.

Article in English | MEDLINE | ID: mdl-37208161

ABSTRACT

SUMMARY: The RaggedExperiment R / Bioconductor package provides lossless representation of disparate genomic ranges across multiple specimens or cells, in conjunction with efficient and flexible calculations of rectangular-shaped summaries for downstream analysis. Applications include statistical analysis of somatic mutations, copy number, methylation, and open chromatin data. RaggedExperiment is compatible with multimodal data analysis as a component of MultiAssayExperiment data objects, and simplifies data representation and transformation for software developers and analysts. MOTIVATION AND RESULTS: Measurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce "ragged" genomic ranges data: i.e. across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.

Subject(s)

Genomics , Neoplasms , Humans , Genome , Software , Mutation , Neoplasms/genetics

8.

Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space.

Schatz, Michael C; Philippakis, Anthony A; Afgan, Enis; Banks, Eric; Carey, Vincent J; Carroll, Robert J; Culotti, Alessandro; Ellrott, Kyle; Goecks, Jeremy; Grossman, Robert L; Hall, Ira M; Hansen, Kasper D; Lawson, Jonathan; Leek, Jeffrey T; Luria, Anne O'Donnell; Mosher, Stephen; Morgan, Martin; Nekrutenko, Anton; O'Connor, Brian D; Osborn, Kevin; Paten, Benedict; Patterson, Candace; Tan, Frederick J; Taylor, Casey Overby; Vessio, Jennifer; Waldron, Levi; Wang, Ting; Wuichet, Kristin.

Cell Genom ; 2(1)2022 Jan 12.

Article in English | MEDLINE | ID: mdl-35199087

ABSTRACT

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.

9.

Open-source Software Sustainability Models: Initial White Paper From the Informatics Technology for Cancer Research Sustainability and Industry Partnership Working Group.

Ye, Ye; Barapatre, Seemran; Davis, Michael K; Elliston, Keith O; Davatzikos, Christos; Fedorov, Andrey; Fillion-Robin, Jean-Christophe; Foster, Ian; Gilbertson, John R; Lasso, Andras; Miller, James V; Morgan, Martin; Pieper, Steve; Raumann, Brigitte E; Sarachan, Brion D; Savova, Guergana; Silverstein, Jonathan C; Taylor, Donald P; Zelnis, Joyce B; Zhang, Guo-Qiang; Cuticchia, Jamie; Becich, Michael J.

J Med Internet Res ; 23(12): e20028, 2021 12 02.

Article in English | MEDLINE | ID: mdl-34860667

ABSTRACT

BACKGROUND: The National Cancer Institute Informatics Technology for Cancer Research (ITCR) program provides a series of funding mechanisms to create an ecosystem of open-source software (OSS) that serves the needs of cancer research. As the ITCR ecosystem substantially grows, it faces the challenge of the long-term sustainability of the software being developed by ITCR grantees. To address this challenge, the ITCR sustainability and industry partnership working group (SIP-WG) was convened in 2019. OBJECTIVE: The charter of the SIP-WG is to investigate options to enhance the long-term sustainability of the OSS being developed by ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plans for ITCR OSS development initiatives. The working group assembled models from the ITCR program, from other studies, and from the engagement of its extensive network of relationships with other organizations (eg, Chan Zuckerberg Initiative, Open Source Initiative, and Software Sustainability Institute) in support of this objective. METHODS: This paper reviews the existing sustainability models and describes 10 OSS use cases disseminated by the SIP-WG and others, including 3D Slicer, Bioconductor, Cytoscape, Globus, i2b2 (Informatics for Integrating Biology and the Bedside) and tranSMART, Insight Toolkit, Linux, Observational Health Data Sciences and Informatics tools, R, and REDCap (Research Electronic Data Capture), in 10 sustainability aspects: governance, documentation, code quality, support, ecosystem collaboration, security, legal, finance, marketing, and dependency hygiene. RESULTS: Information available to the public reveals that all 10 OSS have effective governance, comprehensive documentation, high code quality, reliable dependency hygiene, strong user and developer support, and active marketing. These OSS include a variety of licensing models (eg, general public license version 2, general public license version 3, Berkeley Software Distribution, and Apache 3) and financial models (eg, federal research funding, industry and membership support, and commercial support). However, detailed information on ecosystem collaboration and security is not publicly provided by most OSS. CONCLUSIONS: We recommend 6 essential attributes for research software: alignment with unmet scientific needs, a dedicated development team, a vibrant user community, a feasible licensing model, a sustainable financial model, and effective product management. We also stress important actions to be considered in future ITCR activities that involve the discussion of the sustainability and licensing models for ITCR OSS, the establishment of a central library, the allocation of consulting resources to code quality control, ecosystem collaboration, security, and dependency hygiene.

Subject(s)

Ecosystem , Neoplasms , Humans , Informatics , Neoplasms/therapy , Research , Software , Technology

10.

Bioconductor toolchain for reproducible bioinformatics pipelines using Rcwl and RcwlPipelines.

Hu, Qiang; Hutson, Alan; Liu, Song; Morgan, Martin; Liu, Qian.

Bioinformatics ; 37(19): 3351-3352, 2021 Oct 11.

Article in English | MEDLINE | ID: mdl-33772584

ABSTRACT

SUMMARY: The Common Workflow Language (CWL) is used to provide portable and reproducible data analysis workflows across different tools and computing environments. We have developed Rcwl, an R interface to CWL, to provide easier development, use and maintenance of CWL pipelines from within R. We have also collected more than 100 pre-built tools and pipelines in RcwlPipelines, ready to be queried and used by researchers in their own analysis. A single-cell RNA sequencing preprocessing pipeline demonstrates use of the software. AVAILABILITY AND IMPLEMENTATION: Project website: https://rcwl.org (Rcwl: https://bioconductor.org/packages/Rcwl; RcwlPipelines: https://bioconductor.org/packages/RcwlPipelines). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.

Toward a gold standard for benchmarking gene set enrichment analysis.

Geistlinger, Ludwig; Csaba, Gergely; Santarelli, Mara; Ramos, Marcel; Schiffer, Lucas; Turaga, Nitesh; Law, Charity; Davis, Sean; Carey, Vincent; Morgan, Martin; Zimmer, Ralf; Waldron, Levi.

Brief Bioinform ; 22(1): 545-556, 2021 01 18.

Article in English | MEDLINE | ID: mdl-32026945

ABSTRACT

MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.

Subject(s)

Gene Expression Profiling/methods , Genomics/methods , RNA-Seq/methods , Animals , Benchmarking , Databases, Genetic/standards , Gene Expression Profiling/standards , Genomics/standards , Humans , RNA-Seq/standards , Software

12.

Multiomic Integration of Public Oncology Databases in Bioconductor.

Ramos, Marcel; Geistlinger, Ludwig; Oh, Sehyun; Schiffer, Lucas; Azhar, Rimsha; Kodali, Hanish; de Bruijn, Ino; Gao, Jianjiong; Carey, Vincent J; Morgan, Martin; Waldron, Levi.

JCO Clin Cancer Inform ; 4: 958-971, 2020 10.

Article in English | MEDLINE | ID: mdl-33119407

ABSTRACT

PURPOSE: Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS: We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS: We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION: These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.

Subject(s)

Computational Biology , Data Management , Databases, Genetic , Genomics , Humans , Software

13.

Multiomic Analysis of Subtype Evolution and Heterogeneity in High-Grade Serous Ovarian Carcinoma.

Geistlinger, Ludwig; Oh, Sehyun; Ramos, Marcel; Schiffer, Lucas; LaRue, Rebecca S; Henzler, Christine M; Munro, Sarah A; Daughters, Claire; Nelson, Andrew C; Winterhoff, Boris J; Chang, Zenas; Talukdar, Shobhana; Shetty, Mihir; Mullany, Sally A; Morgan, Martin; Parmigiani, Giovanni; Birrer, Michael J; Qin, Li-Xuan; Riester, Markus; Starr, Timothy K; Waldron, Levi.

Cancer Res ; 80(20): 4335-4345, 2020 10 15.

Article in English | MEDLINE | ID: mdl-32747365

ABSTRACT

Multiple studies have identified transcriptome subtypes of high-grade serous ovarian carcinoma (HGSOC), but their interpretation and translation are complicated by tumor evolution and polyclonality accompanied by extensive accumulation of somatic aberrations, varying cell type admixtures, and different tissues of origin. In this study, we examined the chronology of HGSOC subtype evolution in the context of these factors using a novel integrative analysis of absolute copy-number analysis and gene expression in The Cancer Genome Atlas complemented by single-cell analysis of six independent tumors. Tumor purity, ploidy, and subclonality were reliably inferred from different genomic platforms, and these characteristics displayed marked differences between subtypes. Genomic lesions associated with HGSOC subtypes tended to be subclonal, implying subtype divergence at later stages of tumor evolution. Subclonality of recurrent HGSOC alterations was evident for proliferative tumors, characterized by extreme genomic instability, absence of immune infiltration, and greater patient age. In contrast, differentiated tumors were characterized by largely intact genome integrity, high immune infiltration, and younger patient age. Single-cell sequencing of 42,000 tumor cells revealed widespread heterogeneity in tumor cell type composition that drove bulk subtypes but demonstrated a lack of intrinsic subtypes among tumor epithelial cells. Our findings prompt the dismissal of discrete transcriptome subtypes for HGSOC and replacement by a more realistic model of continuous tumor development that includes mixtures of subclones, accumulation of somatic aberrations, infiltration of immune and stromal cells in proportions correlated with tumor stage and tissue of origin, and evolution between properties previously associated with discrete subtypes. SIGNIFICANCE: This study infers whether transcriptome-based groupings of tumors differentiate early in carcinogenesis and are, therefore, appropriate targets for therapy and demonstrates that this is not the case for HGSOC.

Subject(s)

Cystadenocarcinoma, Serous/genetics , Cystadenocarcinoma, Serous/pathology , Ovarian Neoplasms/genetics , Ovarian Neoplasms/pathology , Female , Gene Expression Profiling , Genomic Instability , Humans , Ploidies , Single-Cell Analysis

14.

Cancer Moonshot Immuno-Oncology Translational Network (IOTN): accelerating the clinical translation of basic discoveries for improving immunotherapy and immunoprevention of cancer.

Annapragada, Ananth; Sikora, Andrew; Bollard, Catherine; Conejo-Garcia, Jose; Cruz, Conrad Russell; Demehri, Shadmehr; Demetriou, Michael; Demirdjian, Levon; Fong, Lawrence; Horowitz, Mary; Hutson, Alan; Kadash-Edmondson, Kathryn; Kufe, Donald; Lipkin, Steven; Liu, Song; McCarthy, Claire; Morgan, Martin; Morris, Zachary; Pan, Yang; Pasquini, Marcelo; Schoenberger, Stephen; Van Allen, Eliezer; Vilar, Eduardo; Xing, Yi; Zha, Wenjuan; Odunsi, Adekunle.

J Immunother Cancer ; 8(1)2020 06.

Article in English | MEDLINE | ID: mdl-32554617

ABSTRACT

Despite regulatory approval of several immune-based treatments for cancer in the past decade, a number of barriers remain to be addressed in order to fully harness the therapeutic potential of the immune system and provide benefits for patients with cancer. As part of the Cancer Moonshot initiative, the Immuno-Oncology Translational Network (IOTN) was established to accelerate the translation of basic discoveries to improve immunotherapy outcomes across the spectrum of adult cancers and to develop immune-based approaches that prevent cancers before they occur. The IOTN currently consists of 32 academic institutions in the USA. By leveraging cutting-edge preclinical research in immunotherapy and immunoprevention, open data and resource sharing, and fostering highly collaborative team science across the immuno-oncology ecosystem, the IOTN is designed to accelerate the generation of novel mechanism-driven immune-based cancer prevention and therapies, and the development of safe and effective personalized immuno-oncology approaches.

Subject(s)

Immunotherapy/methods , Medical Oncology/organization & administration , Neoplasms/drug therapy , Neoplasms/immunology , Humans

15.

Global Alliance for Genomics and Health Meets Bioconductor: Toward Reproducible and Agile Cancer Genomics at Cloud Scale.

Carey, Vincent J; Ramos, Marcel; Stubbs, Benjamin J; Gopaulakrishnan, Shweta; Oh, Sehyun; Turaga, Nitesh; Waldron, Levi; Morgan, Martin.

JCO Clin Cancer Inform ; 4: 472-479, 2020 05.

Article in English | MEDLINE | ID: mdl-32453635

ABSTRACT

PURPOSE: Institutional efforts toward the democratization of cloud-scale data and analysis methods for cancer genomics are proceeding rapidly. As part of this effort, we bridge two major bioinformatic initiatives: the Global Alliance for Genomics and Health (GA4GH) and Bioconductor. METHODS: We describe in detail a use case in pancancer transcriptomics conducted by blending implementations of the GA4GH Workflow Execution Services and Tool Registry Service concepts with the Bioconductor curatedTCGAData and BiocOncoTK packages. RESULTS: We carried out the analysis with a formally archived workflow and container at dockstore.org and a workspace and notebook at app.terra.bio. The analysis identified relationships between microsatellite instability and biomarkers of immune dysregulation at a finer level of granularity than previously reported. Our use of standard approaches to containerization and workflow programming allows this analysis to be replicated and extended. CONCLUSION: Experimental use of dockstore.org and app.terra.bio in concert with Bioconductor enabled novel statistical analysis of large genomic projects without the need for local supercomputing resources but involved challenges related to container design, script archiving, and unit testing. Best practices and cost/benefit metrics for the management and analysis of globally federated genomic data and annotation are evolving. The creation and execution of use cases like the one reported here will be helpful in the development and comparison of approaches to federated data/analysis systems in cancer genomics.

Subject(s)

Neoplasms , Software , Computational Biology , Genomics , Humans , Neoplasms/genetics , Workflow

16.

Reliable Analysis of Clinical Tumor-Only Whole-Exome Sequencing Data.

Oh, Sehyun; Geistlinger, Ludwig; Ramos, Marcel; Morgan, Martin; Waldron, Levi; Riester, Markus.

JCO Clin Cancer Inform ; 4: 321-335, 2020 04.

Article in English | MEDLINE | ID: mdl-32282230

ABSTRACT

PURPOSE: Allele-specific copy number alteration (CNA) analysis is essential to study the functional impact of single-nucleotide variants (SNVs) and the process of tumorigenesis. However, controversy over whether it can be performed with sufficient accuracy in data without matched normal profiles and a lack of open-source implementations have limited its application in clinical research and diagnosis. METHODS: We benchmark allele-specific CNA analysis performance of whole-exome sequencing (WES) data against gold standard whole-genome SNP6 microarray data and against WES data sets with matched normal samples. We provide a workflow based on the open-source PureCN R/Bioconductor package in conjunction with widely used variant-calling and copy number segmentation algorithms for allele-specific CNA analysis from WES without matched normals. This workflow further classifies SNVs by somatic status and then uses this information to infer somatic mutational signatures and tumor mutational burden (TMB). RESULTS: Application of our workflow to tumor-only WES data produces tumor purity and ploidy estimates that are highly concordant with estimates from SNP6 microarray data and matched normal WES data. The presence of cancer type-specific somatic mutational signatures was inferred with high accuracy. We also demonstrate high concordance of TMB between our tumor-only workflow and matched normal pipelines. CONCLUSION: The proposed workflow provides, to our knowledge, the only open-source option with demonstrated high accuracy for comprehensive allele-specific CNA analysis and SNV classification of tumor-only WES. An implementation of the workflow is available on the Terra Cloud platform of the Broad Institute (Cambridge, MA).

Subject(s)

Algorithms , Biomarkers, Tumor/genetics , DNA Copy Number Variations , Exome Sequencing/methods , Exome , Mutation , Neoplasms/genetics , Gene Expression Regulation, Neoplastic , High-Throughput Nucleotide Sequencing , Humans , Neoplasms/pathology , Neoplasms/therapy

17.

Tximeta: Reference sequence checksums for provenance identification in RNA-seq.

Love, Michael I; Soneson, Charlotte; Hickey, Peter F; Johnson, Lisa K; Pierce, N Tessa; Shepherd, Lori; Morgan, Martin; Patro, Rob.

PLoS Comput Biol ; 16(2): e1007664, 2020 02.

Article in English | MEDLINE | ID: mdl-32097405

ABSTRACT

Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.

Subject(s)

Computational Biology/methods , Gene Expression Profiling , RNA-Seq , Algorithms , Animals , Drosophila melanogaster , Genomics , Humans , Mice , Models, Statistical , Pattern Recognition, Automated , Programming Languages , Reproducibility of Results , Software , Transcriptome

18.

Orchestrating single-cell analysis with Bioconductor.

Amezquita, Robert A; Lun, Aaron T L; Becht, Etienne; Carey, Vince J; Carpp, Lindsay N; Geistlinger, Ludwig; Marini, Federico; Rue-Albrecht, Kevin; Risso, Davide; Soneson, Charlotte; Waldron, Levi; Pagès, Hervé; Smith, Mike L; Huber, Wolfgang; Morgan, Martin; Gottardo, Raphael; Hicks, Stephanie C.

Nat Methods ; 17(2): 137-145, 2020 02.

Article in English | MEDLINE | ID: mdl-31792435

ABSTRACT

Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights. The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools, we present an overview and online book (https://osca.bioconductor.org) of single-cell methods for prospective users.

Subject(s)

Single-Cell Analysis/methods , Gene Expression Profiling , Genome , High-Throughput Nucleotide Sequencing , Software

19.

Publisher Correction: Orchestrating single-cell analysis with Bioconductor.

Amezquita, Robert A; Lun, Aaron T L; Becht, Etienne; Carey, Vince J; Carpp, Lindsay N; Geistlinger, Ludwig; Marini, Federico; Rue-Albrecht, Kevin; Risso, Davide; Soneson, Charlotte; Waldron, Levi; Pagès, Hervé; Smith, Mike L; Huber, Wolfgang; Morgan, Martin; Gottardo, Raphael; Hicks, Stephanie C.

Nat Methods ; 17(2): 242, 2020 Feb.

Article in English | MEDLINE | ID: mdl-31827272

ABSTRACT

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

20.

BiocPkgTools: Toolkit for mining the Bioconductor package ecosystem.

Su, Shian; Carey, Vincent J; Shepherd, Lori; Ritchie, Matthew; Morgan, Martin T; Davis, Sean.

F1000Res ; 8: 752, 2019.

Article in English | MEDLINE | ID: mdl-31249680

ABSTRACT

Motivation: The Bioconductor project, a large collection of open source software for the comprehension of large-scale biological data, continues to grow with new packages added each week, motivating the development of software tools focused on exposing package metadata to developers and users. The resulting BiocPkgTools package facilitates access to extensive metadata in computable form covering the Bioconductor package ecosystem, facilitating downstream applications such as custom reporting, data and text mining of Bioconductor package text descriptions, graph analytics over package dependencies, and custom search approaches. Results: The BiocPkgTools package has been incorporated into the Bioconductor project, installs using standard procedures, and runs on any system supporting R. It provides functions to load detailed package metadata, longitudinal package download statistics, package dependencies, and Bioconductor build reports, all in "tidy data" form. BiocPkgTools can convert from tidy data structures to graph structures, enabling graph-based analytics and visualization. An end-user-friendly graphical package explorer aids in task-centric package discovery. Full documentation and example use cases are included. Availability: The BiocPkgTools software and complete documentation are available from Bioconductor ( https://bioconductor.org/packages/BiocPkgTools).

Subject(s)

Data Mining , Software , Metadata

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL