Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 62
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Nat Methods ; 21(7): 1166-1170, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38877315

ABSTRACT

The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.


Subject(s)
Software , Humans , Computational Biology/methods , Leukocytes, Mononuclear/metabolism , Leukocytes, Mononuclear/cytology , Genomics/methods , Data Analysis
2.
Bioinformatics ; 40(8)2024 08 02.
Article in English | MEDLINE | ID: mdl-39067017

ABSTRACT

MOTIVATION: Software is vital for the advancement of biology and medicine. Impact evaluations of scientific software have primarily emphasized traditional citation metrics of associated papers, despite these metrics inadequately capturing the dynamic picture of impact and despite challenges with improper citation. RESULTS: To understand how software developers evaluate their tools, we conducted a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We found that although developers realize the value of more extensive metric collection, they find a lack of funding and time hindering. We also investigated software among this community for how often infrastructure that supports more nontraditional metrics were implemented and how this impacted rates of papers describing usage of the software. We found that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seemed to be associated with increased mention rates. Analysing more diverse metrics can enable developers to better understand user engagement, justify continued funding, identify novel use cases, pinpoint improvement areas, and ultimately amplify their software's impact. Challenges are associated, including distorted or misleading metrics, as well as ethical and security concerns. More attention to nuances involved in capturing impact across the spectrum of biomedical software is needed. For funders and developers, we outline guidance based on experience from our community. By considering how we evaluate software, we can empower developers to create tools that more effectively accelerate biological and medical research progress. AVAILABILITY AND IMPLEMENTATION: More information about the analysis, as well as access to data and code is available at https://github.com/fhdsl/ITCR_Metrics_manuscript_website.


Subject(s)
Biomedical Research , Software , Biomedical Research/methods , Humans , United States , Computational Biology/methods
3.
BMC Bioinformatics ; 25(1): 8, 2024 Jan 03.
Article in English | MEDLINE | ID: mdl-38172657

ABSTRACT

BACKGROUND: The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. RESULTS: Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. CONCLUSIONS: ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).


Subject(s)
Data Management , Genomics , Software , Programming Languages , Workflow
4.
Bioinformatics ; 39(6)2023 06 01.
Article in English | MEDLINE | ID: mdl-37208161

ABSTRACT

SUMMARY: The RaggedExperiment R / Bioconductor package provides lossless representation of disparate genomic ranges across multiple specimens or cells, in conjunction with efficient and flexible calculations of rectangular-shaped summaries for downstream analysis. Applications include statistical analysis of somatic mutations, copy number, methylation, and open chromatin data. RaggedExperiment is compatible with multimodal data analysis as a component of MultiAssayExperiment data objects, and simplifies data representation and transformation for software developers and analysts. MOTIVATION AND RESULTS: Measurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce "ragged" genomic ranges data: i.e. across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.


Subject(s)
Genomics , Neoplasms , Humans , Genome , Software , Mutation , Neoplasms/genetics
5.
PLoS Comput Biol ; 19(8): e1011324, 2023 08.
Article in English | MEDLINE | ID: mdl-37624866

ABSTRACT

BACKGROUND: The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes. RESULTS: We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor's Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor's ecosystem of hundreds of packages for single-cell and multimodal data. CONCLUSIONS: We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.


Subject(s)
Ecosystem , Proteomics , Cell Differentiation , Computational Biology , Epigenomics
7.
Nat Methods ; 17(2): 137-145, 2020 02.
Article in English | MEDLINE | ID: mdl-31792435

ABSTRACT

Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights. The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools, we present an overview and online book (https://osca.bioconductor.org) of single-cell methods for prospective users.


Subject(s)
Single-Cell Analysis/methods , Gene Expression Profiling , Genome , High-Throughput Nucleotide Sequencing , Software
8.
Brief Bioinform ; 22(1): 545-556, 2021 01 18.
Article in English | MEDLINE | ID: mdl-32026945

ABSTRACT

MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.


Subject(s)
Gene Expression Profiling/methods , Genomics/methods , RNA-Seq/methods , Animals , Benchmarking , Databases, Genetic/standards , Gene Expression Profiling/standards , Genomics/standards , Humans , RNA-Seq/standards , Software
9.
Bioinformatics ; 37(19): 3351-3352, 2021 Oct 11.
Article in English | MEDLINE | ID: mdl-33772584

ABSTRACT

SUMMARY: The Common Workflow Language (CWL) is used to provide portable and reproducible data analysis workflows across different tools and computing environments. We have developed Rcwl, an R interface to CWL, to provide easier development, use and maintenance of CWL pipelines from within R. We have also collected more than 100 pre-built tools and pipelines in RcwlPipelines, ready to be queried and used by researchers in their own analysis. A single-cell RNA sequencing preprocessing pipeline demonstrates use of the software. AVAILABILITY AND IMPLEMENTATION: Project website: https://rcwl.org (Rcwl: https://bioconductor.org/packages/Rcwl; RcwlPipelines: https://bioconductor.org/packages/RcwlPipelines). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

10.
PLoS Comput Biol ; 16(2): e1007664, 2020 02.
Article in English | MEDLINE | ID: mdl-32097405

ABSTRACT

Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.


Subject(s)
Computational Biology/methods , Gene Expression Profiling , RNA-Seq , Algorithms , Animals , Drosophila melanogaster , Genomics , Humans , Mice , Models, Statistical , Pattern Recognition, Automated , Programming Languages , Reproducibility of Results , Software , Transcriptome
11.
J Med Internet Res ; 23(12): e20028, 2021 12 02.
Article in English | MEDLINE | ID: mdl-34860667

ABSTRACT

BACKGROUND: The National Cancer Institute Informatics Technology for Cancer Research (ITCR) program provides a series of funding mechanisms to create an ecosystem of open-source software (OSS) that serves the needs of cancer research. As the ITCR ecosystem substantially grows, it faces the challenge of the long-term sustainability of the software being developed by ITCR grantees. To address this challenge, the ITCR sustainability and industry partnership working group (SIP-WG) was convened in 2019. OBJECTIVE: The charter of the SIP-WG is to investigate options to enhance the long-term sustainability of the OSS being developed by ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plans for ITCR OSS development initiatives. The working group assembled models from the ITCR program, from other studies, and from the engagement of its extensive network of relationships with other organizations (eg, Chan Zuckerberg Initiative, Open Source Initiative, and Software Sustainability Institute) in support of this objective. METHODS: This paper reviews the existing sustainability models and describes 10 OSS use cases disseminated by the SIP-WG and others, including 3D Slicer, Bioconductor, Cytoscape, Globus, i2b2 (Informatics for Integrating Biology and the Bedside) and tranSMART, Insight Toolkit, Linux, Observational Health Data Sciences and Informatics tools, R, and REDCap (Research Electronic Data Capture), in 10 sustainability aspects: governance, documentation, code quality, support, ecosystem collaboration, security, legal, finance, marketing, and dependency hygiene. RESULTS: Information available to the public reveals that all 10 OSS have effective governance, comprehensive documentation, high code quality, reliable dependency hygiene, strong user and developer support, and active marketing. These OSS include a variety of licensing models (eg, general public license version 2, general public license version 3, Berkeley Software Distribution, and Apache 3) and financial models (eg, federal research funding, industry and membership support, and commercial support). However, detailed information on ecosystem collaboration and security is not publicly provided by most OSS. CONCLUSIONS: We recommend 6 essential attributes for research software: alignment with unmet scientific needs, a dedicated development team, a vibrant user community, a feasible licensing model, a sustainable financial model, and effective product management. We also stress important actions to be considered in future ITCR activities that involve the discussion of the sustainability and licensing models for ITCR OSS, the establishment of a central library, the allocation of consulting resources to code quality control, ecosystem collaboration, security, and dependency hygiene.


Subject(s)
Ecosystem , Neoplasms , Humans , Informatics , Neoplasms/therapy , Research , Software , Technology
12.
Bioinformatics ; 35(11): 1968-1970, 2019 06 01.
Article in English | MEDLINE | ID: mdl-30395168

ABSTRACT

SUMMARY: To address the limited software options for performing survival analyses with millions of SNPs, we developed gwasurvivr, an R/Bioconductor package with a simple interface for conducting genome-wide survival analyses using VCF (outputted from Michigan or Sanger imputation servers), IMPUTE2 or PLINK files. To decrease the number of iterations needed for convergence when optimizing the parameter estimates in the Cox model, we modified the R package survival; covariates in the model are first fit without the SNP, and those parameter estimates are used as initial points. We benchmarked gwasurvivr with other software capable of conducting genome-wide survival analysis (genipe, SurvivalGWAS_SV and GWASTools). gwasurvivr is significantly faster and shows better scalability as sample size, number of SNPs and number of covariates increases. AVAILABILITY AND IMPLEMENTATION: gwasurvivr, including source code, documentation and vignette are available at: http://bioconductor.org/packages/gwasurvivr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Software , Polymorphism, Single Nucleotide , Survival Analysis
13.
Nat Methods ; 12(2): 115-21, 2015 Feb.
Article in English | MEDLINE | ID: mdl-25633503

ABSTRACT

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.


Subject(s)
Computational Biology , Gene Expression Profiling , Genomics/methods , High-Throughput Screening Assays/methods , Software , Programming Languages , User-Computer Interface
14.
Brief Bioinform ; 17(4): 603-15, 2016 07.
Article in English | MEDLINE | ID: mdl-26463000

ABSTRACT

Molecular interrogation of a biological sample through DNA sequencing, RNA and microRNA profiling, proteomics and other assays, has the potential to provide a systems level approach to predicting treatment response and disease progression, and to developing precision therapies. Large publicly funded projects have generated extensive and freely available multi-assay data resources; however, bioinformatic and statistical methods for the analysis of such experiments are still nascent. We review multi-assay genomic data resources in the areas of clinical oncology, pharmacogenomics and other perturbation experiments, population genomics and regulatory genomics and other areas, and tools for data acquisition. Finally, we review bioinformatic tools that are explicitly geared toward integrative genomic data visualization and analysis. This review provides starting points for accessing publicly available data and tools to support development of needed integrative methods.


Subject(s)
Genomics , Computational Biology , MicroRNAs , Sequence Analysis, DNA
16.
Bioinformatics ; 30(14): 2076-8, 2014 Jul 15.
Article in English | MEDLINE | ID: mdl-24681907

ABSTRACT

UNLABELLED: VariantAnnotation is an R / Bioconductor package for the exploration and annotation of genetic variants. Capabilities exist for reading, writing and filtering variant call format (VCF) files. VariantAnnotation allows ready access to additional R / Bioconductor facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources. AVAILABILITY AND IMPLEMENTATION: This package is implemented in R and available for download at the Bioconductor Web site (http://bioconductor.org/packages/2.13/bioc/html/VariantAnnotation.html). The package contains extensive help pages for individual functions and a 'vignette' outlining typical work flows; it is made available under the open source 'Artistic-2.0' license. Version 1.9.38 was used in this article.


Subject(s)
Genetic Variation , Molecular Sequence Annotation , Software , Genomics
17.
Sex Transm Dis ; 42(9): 475-481, 2015 Sep.
Article in English | MEDLINE | ID: mdl-26267872

ABSTRACT

BACKGROUND: Cervicitis is an inflammatory condition of the cervix associated with upper genital tract infection and reproductive complications. Although cervicitis can be caused by several known pathogens, the etiology frequently remains obscure. Here we investigate vaginal bacteria associated with bacterial vaginosis as potential causes of cervicitis. METHODS: Associations between vaginal bacteria and cervicitis were assessed in a retrospective case-control study of women attending a Seattle sexually transmitted disease clinic. Individual bacterial species were detected using 2 molecular methods: quantitative polymerase chain reaction (qPCR) and broad-range 16S rRNA gene PCR with pyrosequencing. The primary finding from this initial study was evaluated using qPCR in a second cohort of Kenyan women. RESULTS: The presence of Mageeibacillus indolicus, formerly BVAB3, in the cervix was associated with cervicitis, whereas the presence of Lactobacillus jensenii was inversely associated. Quantities of these bacteria did not differ between cervicitis cases and controls, although in a model inclusive of presence and abundance, M. indolicus remained significantly associated with cervicitis after adjustment for other cervicitis-causing pathogens. M. indolicus was not associated with cervicitis in our study of Kenyan women, possibly due to differences in the clinical definition of cervicitis. CONCLUSIONS: Colonization of the endocervix with M. indolicus may contribute to the clinical manifestations of cervicitis, but further study is needed to determine whether this finding is repeatable and applicable to diverse groups of women. Colonization of the cervix with L. jensenii could be a marker of health, perhaps reducing inflammation or inhibiting pathogenic infection.


Subject(s)
Cervix Uteri/microbiology , Microbiota , Uterine Cervicitis/microbiology , Vagina/microbiology , Adolescent , Adult , Case-Control Studies , Female , Humans , Lactobacillus/isolation & purification , Middle Aged , Real-Time Polymerase Chain Reaction , Retrospective Studies , Young Adult
18.
Stat Sci ; 29(2): 214-226, 2014 May.
Article in English | MEDLINE | ID: mdl-28018047

ABSTRACT

This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.

19.
PLoS Comput Biol ; 9(8): e1003118, 2013.
Article in English | MEDLINE | ID: mdl-23950696

ABSTRACT

We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.


Subject(s)
Databases, Genetic , Genomics/methods , Software , Algorithms , Animals , Genomics/standards , Humans , Mice , Sequence Alignment , Sequence Analysis, DNA
20.
bioRxiv ; 2024 Sep 27.
Article in English | MEDLINE | ID: mdl-39386647

ABSTRACT

Summary: The Human BioMolecular Atlas Program (HuBMAP) constructs the worldwide available platform to research the human body at the cellular level. The HuBMAP Data Portal encompasses a wide range of data resources measured on emerging experimental technologies at a spatial resolution. To broaden access to the HuBMAP Data Portal, we introduce an R client called HuBMAPR available on Bioconductor. This gives an efficient and programmatic interface, enabling researchers to discover and retrieve HuBMAP data easier and faster. Availability: This package is available on GitHub (https://github.com/christinehou11/HuBMAPR) and has been submitted to Bioconductor.

SELECTION OF CITATIONS
SEARCH DETAIL