|

1.

AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines.

Oh, Sehyun; Gravel-Pucillo, Kai; Ramos, Marcel; Davis, Sean; Carey, Vince; Morgan, Martin; Waldron, Levi.

Res Sq ; 2024 May 15.

Article En | MEDLINE | ID: mdl-38798429

Advancements in sequencing technologies and the development of new data collection methods produce large volumes of biological data. The Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based platform for democratizing access to large-scale genomics data and analysis tools. However, utilizing the full capabilities of AnVIL can be challenging for researchers without extensive bioinformatics expertise, especially for executing complex workflows. Here we present the AnVILWorkflow R package, which enables the convenient execution of bioinformatics workflows hosted on AnVIL directly from an R environment. AnVILWorkflowsimplifies the setup of the cloud computing environment, input data formatting, workflow submission, and retrieval of results through intuitive functions. We demonstrate the utility of AnVILWorkflowfor three use cases: bulk RNA-seq analysis with Salmon, metagenomics analysis with bioBakery, and digital pathology image processing with PathML. The key features of AnVILWorkflow include user-friendly browsing of available data and workflows, seamless integration of R and non-R tools within a reproducible analysis pipeline, and accessibility to scalable computing resources without direct management overhead. While some limitations exist around workflow customization, AnVILWorkflowlowers the barrier to taking advantage of AnVIL's resources, especially for exploratory analyses or bulk processing with established workflows. This empowers a broader community of researchers to leverage the latest genomics tools and datasets using familiar R syntax. This package is distributed through the Bioconductor project (https://bioconductor.org/packages/AnVILWorkflow), and the source code is available through GitHub (https://github.com/shbrief/AnVILWorkflow).

2.

bamSliceR: cross-cohort variant and allelic bias analysis for rare variants and rare diseases.

Huang, Yizhou Peter; Harmon, Lauren; Gardner, Eve; Ma, Xiaotu; Harsh, Josiah; Xue, Zhaoyu; Wen, Hong; Ramos, Marcel; Davis, Sean; Triche, Timothy J.

bioRxiv ; 2023 Sep 17.

Article En | MEDLINE | ID: mdl-37745420

Rare diseases and conditions create unique challenges for genetic epidemiologists precisely because cases and samples are scarce. In recent years, whole-genome and whole-transcriptome sequencing (WGS/WTS) have eased the study of rare genetic variants. Paired WGS and WTS data are ideal, but logistical and financial constraints often preclude generating paired WGS and WTS data. Thus, many databases contain a patchwork of specimens with either WGS or WTS data, but only a minority of samples have both. The NCI Genomic Data Commons facilitates controlled access to genomic and transcriptomic data for thousands of subjects, many with unpaired sequencing results. Local reanalysis of expressed variants across whole transcriptomes requires significant data storage, compute, and expertise. We developed the bamSliceR package to facilitate swift transition from aligned sequence reads to expressed variant characterization. bamSliceR leverages the NCI Genomic Data Commons API to query genomic sub-regions of aligned sequence reads from specimens identified through the robust Bioconductor ecosystem. We demonstrate how population-scale targeted genomic analysis can be completed using orders of magnitude fewer resources in this fashion, with minimal compute burden. We demonstrate pilot results from bamSliceR for the TARGET pediatric AML and BEAT-AML projects, where identification of rare but recurrent somatic variants directly yields biologically testable hypotheses. bamSliceR and its documentation are freely available on GitHub at https://github.com/trichelab/bamSliceR.

3.

Curated single cell multimodal landmark datasets for R/Bioconductor.

Eckenrode, Kelly B; Righelli, Dario; Ramos, Marcel; Argelaguet, Ricard; Vanderaa, Christophe; Geistlinger, Ludwig; Culhane, Aedin C; Gatto, Laurent; Carey, Vincent; Morgan, Martin; Risso, Davide; Waldron, Levi.

PLoS Comput Biol ; 19(8): e1011324, 2023 08.

Article En | MEDLINE | ID: mdl-37624866

BACKGROUND: The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes. RESULTS: We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor's Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor's ecosystem of hundreds of packages for single-cell and multimodal data. CONCLUSIONS: We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.

Ecosystem , Proteomics , Cell Differentiation , Computational Biology , Epigenomics

4.

RaggedExperiment: the missing link between genomic ranges and matrices in Bioconductor.

Ramos, Marcel; Morgan, Martin; Geistlinger, Ludwig; Carey, Vincent J; Waldron, Levi.

Bioinformatics ; 39(6)2023 06 01.

Article En | MEDLINE | ID: mdl-37208161

SUMMARY: The RaggedExperiment R / Bioconductor package provides lossless representation of disparate genomic ranges across multiple specimens or cells, in conjunction with efficient and flexible calculations of rectangular-shaped summaries for downstream analysis. Applications include statistical analysis of somatic mutations, copy number, methylation, and open chromatin data. RaggedExperiment is compatible with multimodal data analysis as a component of MultiAssayExperiment data objects, and simplifies data representation and transformation for software developers and analysts. MOTIVATION AND RESULTS: Measurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce "ragged" genomic ranges data: i.e. across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.

Genomics , Neoplasms , Humans , Genome , Software , Mutation , Neoplasms/genetics

5.

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases.

Oh, Sehyun; Geistlinger, Ludwig; Ramos, Marcel; Blankenberg, Daniel; van den Beek, Marius; Taroni, Jaclyn N; Carey, Vincent J; Greene, Casey S; Waldron, Levi; Davis, Sean.

Nat Commun ; 13(1): 3695, 2022 06 27.

Article En | MEDLINE | ID: mdl-35760813

Millions of transcriptomic profiles have been deposited in public archives, yet remain underused for the interpretation of new experiments. We present a method for interpreting new transcriptomic datasets through instant comparison to public datasets without high-performance computing requirements. We apply Principal Component Analysis on 536 studies comprising 44,890 human RNA sequencing profiles and aggregate sufficiently similar loading vectors to form Replicable Axes of Variation (RAV). RAVs are annotated with metadata of originating studies and by gene set enrichment analysis. Functionality to associate new datasets with RAVs, extract interpretable annotations, and provide intuitive visualization are implemented as the GenomicSuperSignature R/Bioconductor package. We demonstrate the efficient and coherent database search, robustness to batch effects and heterogeneous training data, and transfer learning capacity of our method using TCGA and rare diseases datasets. GenomicSuperSignature aids in analyzing new gene expression data in the context of existing databases using minimal computing resources.

Databases, Genetic , Software , Humans , RNA-Seq , Transcriptome/genetics

6.

Toward a gold standard for benchmarking gene set enrichment analysis.

Geistlinger, Ludwig; Csaba, Gergely; Santarelli, Mara; Ramos, Marcel; Schiffer, Lucas; Turaga, Nitesh; Law, Charity; Davis, Sean; Carey, Vincent; Morgan, Martin; Zimmer, Ralf; Waldron, Levi.

Brief Bioinform ; 22(1): 545-556, 2021 01 18.

Article En | MEDLINE | ID: mdl-32026945

MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.

Gene Expression Profiling/methods , Genomics/methods , RNA-Seq/methods , Animals , Benchmarking , Databases, Genetic/standards , Gene Expression Profiling/standards , Genomics/standards , Humans , RNA-Seq/standards , Software

7.

Multiomic Integration of Public Oncology Databases in Bioconductor.

Ramos, Marcel; Geistlinger, Ludwig; Oh, Sehyun; Schiffer, Lucas; Azhar, Rimsha; Kodali, Hanish; de Bruijn, Ino; Gao, Jianjiong; Carey, Vincent J; Morgan, Martin; Waldron, Levi.

JCO Clin Cancer Inform ; 4: 958-971, 2020 10.

Article En | MEDLINE | ID: mdl-33119407

PURPOSE: Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS: We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS: We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION: These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.

Computational Biology , Data Management , Databases, Genetic , Genomics , Humans , Software

8.

Multiomic Analysis of Subtype Evolution and Heterogeneity in High-Grade Serous Ovarian Carcinoma.

Geistlinger, Ludwig; Oh, Sehyun; Ramos, Marcel; Schiffer, Lucas; LaRue, Rebecca S; Henzler, Christine M; Munro, Sarah A; Daughters, Claire; Nelson, Andrew C; Winterhoff, Boris J; Chang, Zenas; Talukdar, Shobhana; Shetty, Mihir; Mullany, Sally A; Morgan, Martin; Parmigiani, Giovanni; Birrer, Michael J; Qin, Li-Xuan; Riester, Markus; Starr, Timothy K; Waldron, Levi.

Cancer Res ; 80(20): 4335-4345, 2020 10 15.

Article En | MEDLINE | ID: mdl-32747365

Multiple studies have identified transcriptome subtypes of high-grade serous ovarian carcinoma (HGSOC), but their interpretation and translation are complicated by tumor evolution and polyclonality accompanied by extensive accumulation of somatic aberrations, varying cell type admixtures, and different tissues of origin. In this study, we examined the chronology of HGSOC subtype evolution in the context of these factors using a novel integrative analysis of absolute copy-number analysis and gene expression in The Cancer Genome Atlas complemented by single-cell analysis of six independent tumors. Tumor purity, ploidy, and subclonality were reliably inferred from different genomic platforms, and these characteristics displayed marked differences between subtypes. Genomic lesions associated with HGSOC subtypes tended to be subclonal, implying subtype divergence at later stages of tumor evolution. Subclonality of recurrent HGSOC alterations was evident for proliferative tumors, characterized by extreme genomic instability, absence of immune infiltration, and greater patient age. In contrast, differentiated tumors were characterized by largely intact genome integrity, high immune infiltration, and younger patient age. Single-cell sequencing of 42,000 tumor cells revealed widespread heterogeneity in tumor cell type composition that drove bulk subtypes but demonstrated a lack of intrinsic subtypes among tumor epithelial cells. Our findings prompt the dismissal of discrete transcriptome subtypes for HGSOC and replacement by a more realistic model of continuous tumor development that includes mixtures of subclones, accumulation of somatic aberrations, infiltration of immune and stromal cells in proportions correlated with tumor stage and tissue of origin, and evolution between properties previously associated with discrete subtypes. SIGNIFICANCE: This study infers whether transcriptome-based groupings of tumors differentiate early in carcinogenesis and are, therefore, appropriate targets for therapy and demonstrates that this is not the case for HGSOC.

Cystadenocarcinoma, Serous/genetics , Cystadenocarcinoma, Serous/pathology , Ovarian Neoplasms/genetics , Ovarian Neoplasms/pathology , Female , Gene Expression Profiling , Genomic Instability , Humans , Ploidies , Single-Cell Analysis

9.

Global Alliance for Genomics and Health Meets Bioconductor: Toward Reproducible and Agile Cancer Genomics at Cloud Scale.

Carey, Vincent J; Ramos, Marcel; Stubbs, Benjamin J; Gopaulakrishnan, Shweta; Oh, Sehyun; Turaga, Nitesh; Waldron, Levi; Morgan, Martin.

JCO Clin Cancer Inform ; 4: 472-479, 2020 05.

Article En | MEDLINE | ID: mdl-32453635

PURPOSE: Institutional efforts toward the democratization of cloud-scale data and analysis methods for cancer genomics are proceeding rapidly. As part of this effort, we bridge two major bioinformatic initiatives: the Global Alliance for Genomics and Health (GA4GH) and Bioconductor. METHODS: We describe in detail a use case in pancancer transcriptomics conducted by blending implementations of the GA4GH Workflow Execution Services and Tool Registry Service concepts with the Bioconductor curatedTCGAData and BiocOncoTK packages. RESULTS: We carried out the analysis with a formally archived workflow and container at dockstore.org and a workspace and notebook at app.terra.bio. The analysis identified relationships between microsatellite instability and biomarkers of immune dysregulation at a finer level of granularity than previously reported. Our use of standard approaches to containerization and workflow programming allows this analysis to be replicated and extended. CONCLUSION: Experimental use of dockstore.org and app.terra.bio in concert with Bioconductor enabled novel statistical analysis of large genomic projects without the need for local supercomputing resources but involved challenges related to container design, script archiving, and unit testing. Best practices and cost/benefit metrics for the management and analysis of globally federated genomic data and annotation are evolving. The creation and execution of use cases like the one reported here will be helpful in the development and comparison of approaches to federated data/analysis systems in cancer genomics.

Neoplasms , Software , Computational Biology , Genomics , Humans , Neoplasms/genetics , Workflow

10.

Reliable Analysis of Clinical Tumor-Only Whole-Exome Sequencing Data.

Oh, Sehyun; Geistlinger, Ludwig; Ramos, Marcel; Morgan, Martin; Waldron, Levi; Riester, Markus.

JCO Clin Cancer Inform ; 4: 321-335, 2020 04.

Article En | MEDLINE | ID: mdl-32282230

PURPOSE: Allele-specific copy number alteration (CNA) analysis is essential to study the functional impact of single-nucleotide variants (SNVs) and the process of tumorigenesis. However, controversy over whether it can be performed with sufficient accuracy in data without matched normal profiles and a lack of open-source implementations have limited its application in clinical research and diagnosis. METHODS: We benchmark allele-specific CNA analysis performance of whole-exome sequencing (WES) data against gold standard whole-genome SNP6 microarray data and against WES data sets with matched normal samples. We provide a workflow based on the open-source PureCN R/Bioconductor package in conjunction with widely used variant-calling and copy number segmentation algorithms for allele-specific CNA analysis from WES without matched normals. This workflow further classifies SNVs by somatic status and then uses this information to infer somatic mutational signatures and tumor mutational burden (TMB). RESULTS: Application of our workflow to tumor-only WES data produces tumor purity and ploidy estimates that are highly concordant with estimates from SNP6 microarray data and matched normal WES data. The presence of cancer type-specific somatic mutational signatures was inferred with high accuracy. We also demonstrate high concordance of TMB between our tumor-only workflow and matched normal pipelines. CONCLUSION: The proposed workflow provides, to our knowledge, the only open-source option with demonstrated high accuracy for comprehensive allele-specific CNA analysis and SNV classification of tumor-only WES. An implementation of the workflow is available on the Terra Cloud platform of the Broad Institute (Cambridge, MA).

Algorithms , Biomarkers, Tumor/genetics , DNA Copy Number Variations , Exome Sequencing/methods , Exome , Mutation , Neoplasms/genetics , Gene Expression Regulation, Neoplastic , High-Throughput Nucleotide Sequencing , Humans , Neoplasms/pathology , Neoplasms/therapy

11.

HGNChelper: identification and correction of invalid gene symbols for human and mouse.

Oh, Sehyun; Abdelnabi, Jasmine; Al-Dulaimi, Ragheed; Aggarwal, Ayush; Ramos, Marcel; Davis, Sean; Riester, Markus; Waldron, Levi.

F1000Res ; 9: 1493, 2020.

Article En | MEDLINE | ID: mdl-33564398

Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.

12.

CNVRanger: association analysis of CNVs with gene expression and quantitative phenotypes.

da Silva, Vinicius; Ramos, Marcel; Groenen, Martien; Crooijmans, Richard; Johansson, Anna; Regitano, Luciana; Coutinho, Luiz; Zimmer, Ralf; Waldron, Levi; Geistlinger, Ludwig.

Bioinformatics ; 36(3): 972-973, 2020 02 01.

Article En | MEDLINE | ID: mdl-31392308

SUMMARY: Copy number variation (CNV) is a major type of structural genomic variation that is increasingly studied across different species for association with diseases and production traits. Established protocols for experimental detection and computational inference of CNVs from SNP array and next-generation sequencing data are available. We present the CNVRanger R/Bioconductor package which implements a comprehensive toolbox for structured downstream analysis of CNVs. This includes functionality for summarizing individual CNV calls across a population, assessing overlap with functional genomic regions, and genome-wide association analysis with gene expression and quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION: http://bioconductor.org/packages/CNVRanger.

DNA Copy Number Variations , Genome-Wide Association Study , Computational Biology , Phenotype , Polymorphism, Single Nucleotide

13.

In search for the sources of plastic marine litter that contaminates the Easter Island Ecoregion.

Gennip, Simon Jan van; Dewitte, Boris; Garçon, Véronique; Thiel, Martin; Popova, Ekaterina; Drillet, Yann; Ramos, Marcel; Yannicelli, Beatriz; Bravo, Luis; Ory, Nicolas; Luna-Jorquera, Guillermo; Gaymer, Carlos F.

Sci Rep ; 9(1): 19662, 2019 12 23.

Article En | MEDLINE | ID: mdl-31873122

Subtropical gyres are the oceanic regions where plastic litter accumulates over long timescales, exposing surrounding oceanic islands to plastic contamination, with potentially severe consequences on marine life. Islands' exposure to such contaminants, littered over long distances in marine or terrestrial habitats, is due to the ocean currents that can transport plastic over long ranges. Here, this issue is addressed for the Easter Island ecoregion (EIE). High-resolution ocean circulation models are used with a Lagrangian particle-tracking tool to identify the connectivity patterns of the EIE with industrial fishing areas and coastline regions of the Pacific basin. Connectivity patterns for "virtual" particles either floating (such as buoyant macroplastics) or neutrally-buoyant (smaller microplastics) are investigated. We find that the South American shoreline between 20°S and 40°S, and the fishing zone within international waters off Peru (20°S, 80°W) are associated with the highest probability for debris to reach the EIE, with transit times under 2 years. These regions coincide with the most-densely populated coastal region of Chile and the most-intensely fished region in the South Pacific. The findings offer potential for mitigating plastic contamination reaching the EIE through better upstream waste management. Results also highlight the need for international action plans on this important issue.

14.

A longitudinal analysis of albendazole treatment effect on neurocysticercosis cyst evolution using multistate models.

Montgomery, Michelle A; Ramos, Marcel; Kelvin, Elizabeth A; Carpio, Arturo; Jaramillo, Alexander; Hauser, W Allen; Zhang, Hongbin.

Trans R Soc Trop Med Hyg ; 113(12): 781-788, 2019 12 01.

Article En | MEDLINE | ID: mdl-31433058

BACKGROUND: In neurocysticercosis, the larval form of the pork tapeworm Taenia solium appears to evolve through three phases-active, degenerative and sometimes calcification-before disappearance. The antihelmintic drug, albendazole, has been shown to hasten the resolution of active cysts in neurocysticercosis. Little is known about the time cysts take to progress through each phase, with or without treatment. METHODS: We reconfigured brain imaging data from patient level to cyst level for 117 patients in a randomized clinical trial of albendazole in which images were taken at baseline, 1, 6, 12 and 24 mo. Applying a multistate model, we modelled the hazard of a cyst evolving to subsequent cyst phases before the next imaging (vs no change). We examined the impact of albendazole treatment overall and by patient and cyst characteristics on the hazard. RESULTS: Albendazole accelerated the evolution from the active to degenerative phase (HR=2.7, 95% CI 1.3 to 6.5) and from the degenerative phase to disappearance (HR=1.9, 95% CI 1.1 to 3.9). Albendazole's impact was stronger for patients who were male, did not have calcified cysts at baseline and who had multiple cysts in different locations. CONCLUSIONS: This research provides a better understanding of where in the cyst trajectory albendazole has the greatest impact.

Albendazole/therapeutic use , Anticestodal Agents/therapeutic use , Neurocysticercosis/drug therapy , Taenia solium/drug effects , Adult , Animals , Disease Progression , Female , Humans , Longitudinal Studies , Male , Models, Statistical , Neurocysticercosis/diagnostic imaging , Neurocysticercosis/pathology , Neuroimaging , Time Factors

15.

Waldron et al. Reply to "Commentary on the HMP16SData Bioconductor Package".

Waldron, Levi; Schiffer, Lucas; Azhar, Rimsha; Ramos, Marcel; Geistlinger, Ludwig; Segata, Nicola.

Am J Epidemiol ; 188(6): 1031-1032, 2019 06 01.

Article En | MEDLINE | ID: mdl-30689687

Microbiota , Software , Genomics , Humans

16.

HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor.

Schiffer, Lucas; Azhar, Rimsha; Shepherd, Lori; Ramos, Marcel; Geistlinger, Ludwig; Huttenhower, Curtis; Dowd, Jennifer B; Segata, Nicola; Waldron, Levi.

Am J Epidemiol ; 188(6): 1023-1026, 2019 06 01.

Article En | MEDLINE | ID: mdl-30649166

Phase 1 of the Human Microbiome Project (HMP) investigated 18 body subsites of 242 healthy American adults to produce the first comprehensive reference for the composition and variation of the "healthy" human microbiome. Publicly available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis. The HMP16SData package provides count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. By removing bioinformatic hurdles of data access and management, HMP16SData enables epidemiologists with only basic R skills to quickly analyze HMP data.

Databases, Genetic/statistics & numerical data , Microbiota/physiology , RNA, Ribosomal, 16S/metabolism , Adolescent , Adult , Computational Biology , Female , Humans , Male , Young Adult

17.

Orchestrating a community-developed computational workshop and accompanying training materials.

Davis, Sean; Ramos, Marcel; Shepherd, Lori; Turaga, Nitesh; Geistlinger, Ludwig; Morgan, Martin T; Haibe-Kains, Benjamin; Waldron, Levi.

F1000Res ; 7: 1656, 2018.

Article En | MEDLINE | ID: mdl-30473781

The importance of bioinformatics, computational biology, and data science in biomedical research continues to grow, driving a need for effective instruction and education. A workshop setting, with lectures and guided hands-on tutorials, is a common approach to teaching practical computational and analytical methods. Here, we detail the process we used to produce high-quality, community-authored educational materials that are available for public consumption and reuse. The coordinated efforts of 17 authors over 10 weeks resulted in 15 workshops available as a website and as a 388-page electronic book. We describe how we utilized cloud infrastructure, GitHub, and a literate programming approach to robustly deliver hands-on tutorials to participants of the annual Bioconductor conference. The scripts, raw and published workshop materials, and cloud machine image are all openly available. Our approach uses free services and software and can be adapted by workshop organizers and authors in other contests with appropriate technical backgrounds.

Computational Biology , Education

18.

Accessible, curated metagenomic data through ExperimentHub.

Pasolli, Edoardo; Schiffer, Lucas; Manghi, Paolo; Renson, Audrey; Obenchain, Valerie; Truong, Duy Tin; Beghini, Francesco; Malik, Faizan; Ramos, Marcel; Dowd, Jennifer B; Huttenhower, Curtis; Morgan, Martin; Segata, Nicola; Waldron, Levi.

Nat Methods ; 14(11): 1023-1024, 2017 10 31.

Article En | MEDLINE | ID: mdl-29088129

Computational Biology/methods , Metagenomics/methods , Microbiota/genetics , Software , Gastrointestinal Microbiome/genetics , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Genome, Fungal/genetics , Genome, Human/genetics , Humans , Species Specificity

19.

Software for the Integration of Multiomics Experiments in Bioconductor.

Ramos, Marcel; Schiffer, Lucas; Re, Angela; Azhar, Rimsha; Basunia, Azfar; Rodriguez, Carmen; Chan, Tiffany; Chapman, Phil; Davis, Sean R; Gomez-Cabrero, David; Culhane, Aedin C; Haibe-Kains, Benjamin; Hansen, Kasper D; Kodali, Hanish; Louis, Marie S; Mer, Arvind S; Riester, Markus; Morgan, Martin; Carey, Vince; Waldron, Levi.

Cancer Res ; 77(21): e39-e42, 2017 11 01.

Article En | MEDLINE | ID: mdl-29092936

Multiomics experiments are increasingly commonplace in biomedical research and add layers of complexity to experimental design, data integration, and analysis. R and Bioconductor provide a generic framework for statistical analysis and visualization, as well as specialized data classes for a variety of high-throughput data types, but methods are lacking for integrative analysis of multiomics experiments. The MultiAssayExperiment software package, implemented in R and leveraging Bioconductor software and design principles, provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. We provide the unrestricted multiple 'omics data for each cancer tissue in The Cancer Genome Atlas as ready-to-analyze MultiAssayExperiment objects and demonstrate in these and other datasets how the software simplifies data representation, statistical analysis, and visualization. The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable, and reproducible statistical analysis of multiomics data and enhances data science applications of multiple omics datasets. Cancer Res; 77(21); e39-42. ©2017 AACR.

Genomics , Neoplasms/genetics , Software , Computational Biology , Datasets as Topic , Genome, Human , Humans

20.

Racial and Ethnic Subgroup Disparities in Hypertension Prevalence, New York City Health and Nutrition Examination Survey, 2013-2014.

Fei, Kezhen; Rodriguez-Lopez, Jesica S; Ramos, Marcel; Islam, Nadia; Trinh-Shevrin, Chau; Yi, Stella S; Chernov, Claudia; Perlman, Sharon E; Thorpe, Lorna E.

Prev Chronic Dis ; 14: E33, 2017 04 20.

Article En | MEDLINE | ID: mdl-28427484

INTRODUCTION: Racial/ethnic minority adults have higher rates of hypertension than non-Hispanic white adults. We examined the prevalence of hypertension among Hispanic and Asian subgroups in New York City. METHODS: Data from the 2013-2014 New York City Health and Nutrition Examination Survey were used to assess hypertension prevalence among adults (aged ≥20) in New York City (n = 1,476). Hypertension was measured (systolic blood pressure ≥140 mm Hg or diastolic blood pressure ≥90 mm Hg or self-reported hypertension and use of blood pressure medication). Participants self-reported race/ethnicity and country of origin. Multivariable logistic regression models assessed differences in prevalence by race/ethnicity and sociodemographic and health-related characteristics. RESULTS: Overall hypertension prevalence among adults in New York City was 33.9% (43.5% for non-Hispanic blacks, 38.0% for Asians, 33.0% for Hispanics, and 27.5% for non-Hispanic whites). Among Hispanic adults, prevalence was 39.4% for Dominican, 34.2% for Puerto Rican, and 27.5% for Central/South American adults. Among Asian adults, prevalence was 43.0% for South Asian and 39.9% for East/Southeast Asian adults. Adjusting for age, sex, education, and body mass index, 2 major racial/ethnic minority groups had higher odds of hypertension than non-Hispanic whites: non-Hispanic black (AOR [adjusted odds ratio], 2.6; 95% confidence interval [CI], 1.7-3.9) and Asian (AOR, 2.0; 95% CI, 1.2-3.4) adults. Two subgroups had greater odds of hypertension than the non-Hispanic white group: East/Southeast Asian adults (AOR, 2.8; 95% CI, 1.6-4.9) and Dominican adults (AOR, 1.9; 95% CI, 1.1-3.5). CONCLUSION: Racial/ethnic minority subgroups vary in hypertension prevalence, suggesting the need for targeted interventions.

Ethnicity , Hypertension/ethnology , Hypertension/epidemiology , Racial Groups , Adult , Body Mass Index , Female , Humans , Male , Middle Aged , New York City/epidemiology , Prevalence , Risk Factors