Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 59
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Nat Methods ; 2024 Jun 14.
Artículo en Inglés | MEDLINE | ID: mdl-38877315

RESUMEN

The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.

2.
BMC Bioinformatics ; 25(1): 8, 2024 Jan 03.
Artículo en Inglés | MEDLINE | ID: mdl-38172657

RESUMEN

BACKGROUND: The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. RESULTS: Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. CONCLUSIONS: ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).


Asunto(s)
Manejo de Datos , Genómica , Programas Informáticos , Lenguajes de Programación , Flujo de Trabajo
3.
Bioinformatics ; 39(6)2023 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-37208161

RESUMEN

SUMMARY: The RaggedExperiment R / Bioconductor package provides lossless representation of disparate genomic ranges across multiple specimens or cells, in conjunction with efficient and flexible calculations of rectangular-shaped summaries for downstream analysis. Applications include statistical analysis of somatic mutations, copy number, methylation, and open chromatin data. RaggedExperiment is compatible with multimodal data analysis as a component of MultiAssayExperiment data objects, and simplifies data representation and transformation for software developers and analysts. MOTIVATION AND RESULTS: Measurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce "ragged" genomic ranges data: i.e. across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.


Asunto(s)
Genómica , Neoplasias , Humanos , Genoma , Programas Informáticos , Mutación , Neoplasias/genética
4.
PLoS Comput Biol ; 19(8): e1011324, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37624866

RESUMEN

BACKGROUND: The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes. RESULTS: We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor's Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor's ecosystem of hundreds of packages for single-cell and multimodal data. CONCLUSIONS: We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.


Asunto(s)
Ecosistema , Proteómica , Diferenciación Celular , Biología Computacional , Epigenómica
5.
Nat Methods ; 17(2): 137-145, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-31792435

RESUMEN

Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights. The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools, we present an overview and online book (https://osca.bioconductor.org) of single-cell methods for prospective users.


Asunto(s)
Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos
7.
Brief Bioinform ; 22(1): 545-556, 2021 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-32026945

RESUMEN

MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Genómica/métodos , RNA-Seq/métodos , Animales , Benchmarking , Bases de Datos Genéticas/normas , Perfilación de la Expresión Génica/normas , Genómica/normas , Humanos , RNA-Seq/normas , Programas Informáticos
8.
Bioinformatics ; 37(19): 3351-3352, 2021 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-33772584

RESUMEN

SUMMARY: The Common Workflow Language (CWL) is used to provide portable and reproducible data analysis workflows across different tools and computing environments. We have developed Rcwl, an R interface to CWL, to provide easier development, use and maintenance of CWL pipelines from within R. We have also collected more than 100 pre-built tools and pipelines in RcwlPipelines, ready to be queried and used by researchers in their own analysis. A single-cell RNA sequencing preprocessing pipeline demonstrates use of the software. AVAILABILITY AND IMPLEMENTATION: Project website: https://rcwl.org (Rcwl: https://bioconductor.org/packages/Rcwl; RcwlPipelines: https://bioconductor.org/packages/RcwlPipelines). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

9.
PLoS Comput Biol ; 16(2): e1007664, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-32097405

RESUMEN

Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.


Asunto(s)
Biología Computacional/métodos , Perfilación de la Expresión Génica , RNA-Seq , Algoritmos , Animales , Drosophila melanogaster , Genómica , Humanos , Ratones , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas , Lenguajes de Programación , Reproducibilidad de los Resultados , Programas Informáticos , Transcriptoma
10.
J Med Internet Res ; 23(12): e20028, 2021 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-34860667

RESUMEN

BACKGROUND: The National Cancer Institute Informatics Technology for Cancer Research (ITCR) program provides a series of funding mechanisms to create an ecosystem of open-source software (OSS) that serves the needs of cancer research. As the ITCR ecosystem substantially grows, it faces the challenge of the long-term sustainability of the software being developed by ITCR grantees. To address this challenge, the ITCR sustainability and industry partnership working group (SIP-WG) was convened in 2019. OBJECTIVE: The charter of the SIP-WG is to investigate options to enhance the long-term sustainability of the OSS being developed by ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plans for ITCR OSS development initiatives. The working group assembled models from the ITCR program, from other studies, and from the engagement of its extensive network of relationships with other organizations (eg, Chan Zuckerberg Initiative, Open Source Initiative, and Software Sustainability Institute) in support of this objective. METHODS: This paper reviews the existing sustainability models and describes 10 OSS use cases disseminated by the SIP-WG and others, including 3D Slicer, Bioconductor, Cytoscape, Globus, i2b2 (Informatics for Integrating Biology and the Bedside) and tranSMART, Insight Toolkit, Linux, Observational Health Data Sciences and Informatics tools, R, and REDCap (Research Electronic Data Capture), in 10 sustainability aspects: governance, documentation, code quality, support, ecosystem collaboration, security, legal, finance, marketing, and dependency hygiene. RESULTS: Information available to the public reveals that all 10 OSS have effective governance, comprehensive documentation, high code quality, reliable dependency hygiene, strong user and developer support, and active marketing. These OSS include a variety of licensing models (eg, general public license version 2, general public license version 3, Berkeley Software Distribution, and Apache 3) and financial models (eg, federal research funding, industry and membership support, and commercial support). However, detailed information on ecosystem collaboration and security is not publicly provided by most OSS. CONCLUSIONS: We recommend 6 essential attributes for research software: alignment with unmet scientific needs, a dedicated development team, a vibrant user community, a feasible licensing model, a sustainable financial model, and effective product management. We also stress important actions to be considered in future ITCR activities that involve the discussion of the sustainability and licensing models for ITCR OSS, the establishment of a central library, the allocation of consulting resources to code quality control, ecosystem collaboration, security, and dependency hygiene.


Asunto(s)
Ecosistema , Neoplasias , Humanos , Informática , Neoplasias/terapia , Investigación , Programas Informáticos , Tecnología
11.
Bioinformatics ; 35(11): 1968-1970, 2019 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-30395168

RESUMEN

SUMMARY: To address the limited software options for performing survival analyses with millions of SNPs, we developed gwasurvivr, an R/Bioconductor package with a simple interface for conducting genome-wide survival analyses using VCF (outputted from Michigan or Sanger imputation servers), IMPUTE2 or PLINK files. To decrease the number of iterations needed for convergence when optimizing the parameter estimates in the Cox model, we modified the R package survival; covariates in the model are first fit without the SNP, and those parameter estimates are used as initial points. We benchmarked gwasurvivr with other software capable of conducting genome-wide survival analysis (genipe, SurvivalGWAS_SV and GWASTools). gwasurvivr is significantly faster and shows better scalability as sample size, number of SNPs and number of covariates increases. AVAILABILITY AND IMPLEMENTATION: gwasurvivr, including source code, documentation and vignette are available at: http://bioconductor.org/packages/gwasurvivr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Programas Informáticos , Polimorfismo de Nucleótido Simple , Análisis de Supervivencia
12.
Nat Methods ; 12(2): 115-21, 2015 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-25633503

RESUMEN

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.


Asunto(s)
Biología Computacional , Perfilación de la Expresión Génica , Genómica/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Programas Informáticos , Lenguajes de Programación , Interfaz Usuario-Computador
13.
Brief Bioinform ; 17(4): 603-15, 2016 07.
Artículo en Inglés | MEDLINE | ID: mdl-26463000

RESUMEN

Molecular interrogation of a biological sample through DNA sequencing, RNA and microRNA profiling, proteomics and other assays, has the potential to provide a systems level approach to predicting treatment response and disease progression, and to developing precision therapies. Large publicly funded projects have generated extensive and freely available multi-assay data resources; however, bioinformatic and statistical methods for the analysis of such experiments are still nascent. We review multi-assay genomic data resources in the areas of clinical oncology, pharmacogenomics and other perturbation experiments, population genomics and regulatory genomics and other areas, and tools for data acquisition. Finally, we review bioinformatic tools that are explicitly geared toward integrative genomic data visualization and analysis. This review provides starting points for accessing publicly available data and tools to support development of needed integrative methods.


Asunto(s)
Genómica , Biología Computacional , MicroARNs , Análisis de Secuencia de ADN
15.
Bioinformatics ; 30(14): 2076-8, 2014 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-24681907

RESUMEN

UNLABELLED: VariantAnnotation is an R / Bioconductor package for the exploration and annotation of genetic variants. Capabilities exist for reading, writing and filtering variant call format (VCF) files. VariantAnnotation allows ready access to additional R / Bioconductor facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources. AVAILABILITY AND IMPLEMENTATION: This package is implemented in R and available for download at the Bioconductor Web site (http://bioconductor.org/packages/2.13/bioc/html/VariantAnnotation.html). The package contains extensive help pages for individual functions and a 'vignette' outlining typical work flows; it is made available under the open source 'Artistic-2.0' license. Version 1.9.38 was used in this article.


Asunto(s)
Variación Genética , Anotación de Secuencia Molecular , Programas Informáticos , Genómica
16.
Sex Transm Dis ; 42(9): 475-481, 2015 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-26267872

RESUMEN

BACKGROUND: Cervicitis is an inflammatory condition of the cervix associated with upper genital tract infection and reproductive complications. Although cervicitis can be caused by several known pathogens, the etiology frequently remains obscure. Here we investigate vaginal bacteria associated with bacterial vaginosis as potential causes of cervicitis. METHODS: Associations between vaginal bacteria and cervicitis were assessed in a retrospective case-control study of women attending a Seattle sexually transmitted disease clinic. Individual bacterial species were detected using 2 molecular methods: quantitative polymerase chain reaction (qPCR) and broad-range 16S rRNA gene PCR with pyrosequencing. The primary finding from this initial study was evaluated using qPCR in a second cohort of Kenyan women. RESULTS: The presence of Mageeibacillus indolicus, formerly BVAB3, in the cervix was associated with cervicitis, whereas the presence of Lactobacillus jensenii was inversely associated. Quantities of these bacteria did not differ between cervicitis cases and controls, although in a model inclusive of presence and abundance, M. indolicus remained significantly associated with cervicitis after adjustment for other cervicitis-causing pathogens. M. indolicus was not associated with cervicitis in our study of Kenyan women, possibly due to differences in the clinical definition of cervicitis. CONCLUSIONS: Colonization of the endocervix with M. indolicus may contribute to the clinical manifestations of cervicitis, but further study is needed to determine whether this finding is repeatable and applicable to diverse groups of women. Colonization of the cervix with L. jensenii could be a marker of health, perhaps reducing inflammation or inhibiting pathogenic infection.


Asunto(s)
Cuello del Útero/microbiología , Microbiota , Cervicitis Uterina/microbiología , Vagina/microbiología , Adolescente , Adulto , Estudios de Casos y Controles , Femenino , Humanos , Lactobacillus/aislamiento & purificación , Persona de Mediana Edad , Reacción en Cadena en Tiempo Real de la Polimerasa , Estudios Retrospectivos , Adulto Joven
17.
Stat Sci ; 29(2): 214-226, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-28018047

RESUMEN

This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.

18.
PLoS Comput Biol ; 9(8): e1003118, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23950696

RESUMEN

We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.


Asunto(s)
Bases de Datos Genéticas , Genómica/métodos , Programas Informáticos , Algoritmos , Animales , Genómica/normas , Humanos , Ratones , Alineación de Secuencia , Análisis de Secuencia de ADN
19.
Res Sq ; 2024 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-38798429

RESUMEN

Advancements in sequencing technologies and the development of new data collection methods produce large volumes of biological data. The Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based platform for democratizing access to large-scale genomics data and analysis tools. However, utilizing the full capabilities of AnVIL can be challenging for researchers without extensive bioinformatics expertise, especially for executing complex workflows. Here we present the AnVILWorkflow R package, which enables the convenient execution of bioinformatics workflows hosted on AnVIL directly from an R environment. AnVILWorkflowsimplifies the setup of the cloud computing environment, input data formatting, workflow submission, and retrieval of results through intuitive functions. We demonstrate the utility of AnVILWorkflowfor three use cases: bulk RNA-seq analysis with Salmon, metagenomics analysis with bioBakery, and digital pathology image processing with PathML. The key features of AnVILWorkflow include user-friendly browsing of available data and workflows, seamless integration of R and non-R tools within a reproducible analysis pipeline, and accessibility to scalable computing resources without direct management overhead. While some limitations exist around workflow customization, AnVILWorkflowlowers the barrier to taking advantage of AnVIL's resources, especially for exploratory analyses or bulk processing with established workflows. This empowers a broader community of researchers to leverage the latest genomics tools and datasets using familiar R syntax. This package is distributed through the Bioconductor project (https://bioconductor.org/packages/AnVILWorkflow), and the source code is available through GitHub (https://github.com/shbrief/AnVILWorkflow).

20.
bioRxiv ; 2024 May 22.
Artículo en Inglés | MEDLINE | ID: mdl-38826347

RESUMEN

The growth of omic data presents evolving challenges in data manipulation, analysis, and integration. Addressing these challenges, Bioconductor1 provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming2 offers a revolutionary standard for data organisation and manipulation. Here, we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning, and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analysing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas3, spanning six data frameworks and ten analysis tools.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA