RESUMO
MOTIVATION: Software is vital for the advancement of biology and medicine. Impact evaluations of scientific software have primarily emphasized traditional citation metrics of associated papers, despite these metrics inadequately capturing the dynamic picture of impact and despite challenges with improper citation. RESULTS: To understand how software developers evaluate their tools, we conducted a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We found that although developers realize the value of more extensive metric collection, they find a lack of funding and time hindering. We also investigated software among this community for how often infrastructure that supports more nontraditional metrics were implemented and how this impacted rates of papers describing usage of the software. We found that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seemed to be associated with increased mention rates. Analysing more diverse metrics can enable developers to better understand user engagement, justify continued funding, identify novel use cases, pinpoint improvement areas, and ultimately amplify their software's impact. Challenges are associated, including distorted or misleading metrics, as well as ethical and security concerns. More attention to nuances involved in capturing impact across the spectrum of biomedical software is needed. For funders and developers, we outline guidance based on experience from our community. By considering how we evaluate software, we can empower developers to create tools that more effectively accelerate biological and medical research progress. AVAILABILITY AND IMPLEMENTATION: More information about the analysis, as well as access to data and code is available at https://github.com/fhdsl/ITCR_Metrics_manuscript_website.
Assuntos
Pesquisa Biomédica , Software , Pesquisa Biomédica/métodos , Humanos , Estados Unidos , Biologia Computacional/métodosRESUMO
Microbial biochemistry is central to the pathophysiology of inflammatory bowel diseases (IBD). Improved knowledge of microbial metabolites and their immunomodulatory roles is thus necessary for diagnosis and management. Here, we systematically analyzed the chemical, ecological, and epidemiological properties of ~82k metabolic features in 546 Integrative Human Microbiome Project (iHMP/HMP2) metabolomes, using a newly developed methodology for bioactive compound prioritization from microbial communities. This suggested >1000 metabolic features as potentially bioactive in IBD and associated ~43% of prevalent, unannotated features with at least one well-characterized metabolite, thereby providing initial information for further characterization of a significant portion of the fecal metabolome. Prioritized features included known IBD-linked chemical families such as bile acids and short-chain fatty acids, and less-explored bilirubin, polyamine, and vitamin derivatives, and other microbial products. One of these, nicotinamide riboside, reduced colitis scores in DSS-treated mice. The method, MACARRoN, is generalizable with the potential to improve microbial community characterization and provide therapeutic candidates.
Assuntos
Colite , Doenças Inflamatórias Intestinais , Humanos , Animais , Camundongos , Doenças Inflamatórias Intestinais/tratamento farmacológico , Doenças Inflamatórias Intestinais/metabolismo , Metaboloma , Ácidos e Sais BiliaresRESUMO
SUMMARY: The RaggedExperiment R / Bioconductor package provides lossless representation of disparate genomic ranges across multiple specimens or cells, in conjunction with efficient and flexible calculations of rectangular-shaped summaries for downstream analysis. Applications include statistical analysis of somatic mutations, copy number, methylation, and open chromatin data. RaggedExperiment is compatible with multimodal data analysis as a component of MultiAssayExperiment data objects, and simplifies data representation and transformation for software developers and analysts. MOTIVATION AND RESULTS: Measurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce "ragged" genomic ranges data: i.e. across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.
Assuntos
Genômica , Neoplasias , Humanos , Genoma , Software , Mutação , Neoplasias/genéticaRESUMO
BACKGROUND: The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes. RESULTS: We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor's Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor's ecosystem of hundreds of packages for single-cell and multimodal data. CONCLUSIONS: We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.
Assuntos
Ecossistema , Proteômica , Diferenciação Celular , Biologia Computacional , EpigenômicaRESUMO
BACKGROUND: Prospective cohort studies of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) incidence complement case-based surveillance and cross-sectional seroprevalence surveys. METHODS: We estimated the incidence of SARS-CoV-2 infection in a national cohort of 6738 US adults, enrolled in March-August 2020. Using Poisson models, we examined the association of social distancing and a composite epidemiologic risk score with seroconversion. The risk score was created using least absolute shrinkage selection operator (LASSO) regression to identify factors predictive of seroconversion. The selected factors were household crowding, confirmed case in household, indoor dining, gathering with groups of ≥10, and no masking in gyms or salons. RESULTS: Among 4510 individuals with ≥1 serologic test, 323 (7.3% [95% confidence interval (CI), 6.5%-8.1%]) seroconverted by January 2021. Among 3422 participants seronegative in May-September 2020 and retested from November 2020 to January 2021, 161 seroconverted over 1646 person-years of follow-up (9.8 per 100 person-years [95% CI, 8.3-11.5]). The seroincidence rate was lower among women compared with men (incidence rate ratio [IRR], 0.69 [95% CI, .50-.94]) and higher among Hispanic (2.09 [1.41-3.05]) than white non-Hispanic participants. In adjusted models, participants who reported social distancing with people they did not know (IRR for always vs never social distancing, 0.42 [95% CI, .20-1.0]) and with people they knew (IRR for always vs never, 0.64 [.39-1.06]; IRR for sometimes vs never, 0.60 [.38-.96]) had lower seroconversion risk. Seroconversion risk increased with epidemiologic risk score (IRR for medium vs low score, 1.68 [95% CI, 1.03-2.81]; IRR for high vs low score, 3.49 [2.26-5.58]). Only 29% of those who seroconverted reported isolating, and only 19% were asked about contacts. CONCLUSIONS: Modifiable risk factors and poor reach of public health strategies drove SARS-CoV-2 transmission across the United States.
Assuntos
COVID-19 , Soropositividade para HIV , Masculino , Humanos , Adulto , Feminino , Estados Unidos/epidemiologia , SARS-CoV-2 , COVID-19/epidemiologia , Incidência , Estudos Prospectivos , Estudos Transversais , Aglomeração , Estudos Soroepidemiológicos , Características da Família , Fatores de RiscoRESUMO
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
RESUMO
Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights. The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools, we present an overview and online book (https://osca.bioconductor.org) of single-cell methods for prospective users.
Assuntos
Análise de Célula Única/métodos , Perfilação da Expressão Gênica , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , SoftwareRESUMO
MOTIVATION: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS: We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY: http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT: ludwig.geistlinger@sph.cuny.edu.
Assuntos
Perfilação da Expressão Gênica/métodos , Genômica/métodos , RNA-Seq/métodos , Animais , Benchmarking , Bases de Dados Genéticas/normas , Perfilação da Expressão Gênica/normas , Genômica/normas , Humanos , RNA-Seq/normas , SoftwareRESUMO
MOTIVATION: Modern biological screens yield enormous numbers of measurements, and identifying and interpreting statistically significant associations among features are essential. In experiments featuring multiple high-dimensional datasets collected from the same set of samples, it is useful to identify groups of associated features between the datasets in a way that provides high statistical power and false discovery rate (FDR) control. RESULTS: Here, we present a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for structured association discovery between paired high-dimensional datasets. HAllA efficiently integrates hierarchical hypothesis testing with FDR correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data. We optimized and evaluated HAllA using heterogeneous synthetic datasets of known association structure, where HAllA outperformed all-against-all and other block-testing approaches across a range of common similarity measures. We then applied HAllA to a series of real-world multiomics datasets, revealing new associations between gene expression and host immune activity, the microbiome and host transcriptome, metabolomic profiling and human health phenotypes. AVAILABILITY AND IMPLEMENTATION: An open-source implementation of HAllA is freely available at http://huttenhower.sph.harvard.edu/halla along with documentation, demo datasets and a user group. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Microbiota , TranscriptomaRESUMO
It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or compositional measurements. Here we introduce an optimized combination of novel and established methodology to assess multivariable association of microbial community features with complex metadata in population-scale observational studies. Our approach, MaAsLin 2 (Microbiome Multivariable Associations with Linear Models), uses generalized linear and mixed models to accommodate a wide variety of modern epidemiological studies, including cross-sectional and longitudinal designs, as well as a variety of data types (e.g., counts and relative abundances) with or without covariates and repeated measurements. To construct this method, we conducted a large-scale evaluation of a broad range of scenarios under which straightforward identification of meta-omics associations can be challenging. These simulation studies reveal that MaAsLin 2's linear model preserves statistical power in the presence of repeated measures and multiple covariates, while accounting for the nuances of meta-omics features and controlling false discovery. We also applied MaAsLin 2 to a microbial multi-omics dataset from the Integrative Human Microbiome (HMP2) project which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel diseases (IBD) across multiple time points and omics profiles.
Assuntos
Biologia Computacional , Microbioma Gastrointestinal , Análise Multivariada , Simulação por Computador , Humanos , Doenças Inflamatórias Intestinais/genética , Doenças Inflamatórias Intestinais/metabolismo , Doenças Inflamatórias Intestinais/patologiaRESUMO
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
Assuntos
Bioestatística/métodos , Pesquisa em Genética , Genômica/métodos , Modelos Biológicos , Modelos Estatísticos , Genômica/normas , Humanos , Metagenoma/genética , Análise em Microsséries/métodos , Análise em Microsséries/normas , Microbiota/genética , Análise de Sequência de RNA/métodosRESUMO
SUMMARY: Copy number variation (CNV) is a major type of structural genomic variation that is increasingly studied across different species for association with diseases and production traits. Established protocols for experimental detection and computational inference of CNVs from SNP array and next-generation sequencing data are available. We present the CNVRanger R/Bioconductor package which implements a comprehensive toolbox for structured downstream analysis of CNVs. This includes functionality for summarizing individual CNV calls across a population, assessing overlap with functional genomic regions, and genome-wide association analysis with gene expression and quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION: http://bioconductor.org/packages/CNVRanger.
Assuntos
Variações do Número de Cópias de DNA , Estudo de Associação Genômica Ampla , Biologia Computacional , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Recent data suggest that frequent endoscopy and biopsy without evidence of graft dysfunction does not appear to confer survival advantage after intestinal transplantation. After abandoning protocol surveillance, endoscopic examination was decreased significantly at our center. These observations led us to question the need for stoma creation in intestinal transplantation. Herein, we report clinical outcomes of intestinal transplantation without stoma, compared to conventional transplant with stoma. Data analysis was limited to adult intestinal transplantation without liver allograft between 2015 and 2018. We compared patient and graft survival, frequency of endoscopic evaluation, episodes of acute rejection, nutritional therapy, and renal function between "Control group (with stoma)," n = 18 grafts in 16 patients and "Study group (without stoma)," n = 16 grafts in 15 patients. Overall outcome was similar between the 2 groups with respect to graft and patient survival, episodes of acute rejection, and its response to treatment. Nutritional outcomes were similar in both groups. Fewer antidiarrheal medications were required in the study group, but this did not translate into demonstrable gains in preservation of renal function, despite an apparent trend to improvement. Intestinal transplantation without stoma appears to be an acceptable practice model without obvious adverse impact on outcome.
Assuntos
Rejeição de Enxerto , Transplante de Órgãos , Adulto , Rejeição de Enxerto/etiologia , Sobrevivência de Enxerto , Humanos , Imunossupressores , IntestinosRESUMO
Phase 1 of the Human Microbiome Project (HMP) investigated 18 body subsites of 242 healthy American adults to produce the first comprehensive reference for the composition and variation of the "healthy" human microbiome. Publicly available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis. The HMP16SData package provides count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. By removing bioinformatic hurdles of data access and management, HMP16SData enables epidemiologists with only basic R skills to quickly analyze HMP data.
Assuntos
Bases de Dados Genéticas/estatística & dados numéricos , Microbiota/fisiologia , RNA Ribossômico 16S/metabolismo , Adolescente , Adulto , Biologia Computacional , Feminino , Humanos , Masculino , Adulto JovemRESUMO
Summary: bioBakery is a meta'omic analysis environment and collection of individual software tools with the capacity to process raw shotgun sequencing data into actionable microbial community feature profiles, summary reports, and publication-ready figures. It includes a collection of pre-configured analysis modules also joined into workflows for reproducibility. Availability and implementation: bioBakery (http://huttenhower.sph.harvard.edu/biobakery) is publicly available for local installation as individual modules and as a virtual machine image. Each individual module has been developed to perform a particular task (e.g. quantitative taxonomic profiling or statistical analysis), and they are provided with source code, tutorials, demonstration data, and validation results; the bioBakery virtual image includes the entire suite of modules and their dependencies pre-installed. Images are available for both Amazon EC2 and Google Compute Engine. All software is open source under the MIT license. bioBakery is actively maintained with a support group at biobakery-users@googlegroups.com and new tools being added upon their release. Contact: chuttenh@hsph.harvard.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Metagenômica/métodos , Microbiota/genética , Software , Reprodutibilidade dos Testes , Fluxo de TrabalhoRESUMO
Many nonrandomized interventions rely upon a pre-post design to evaluate effectiveness. Such designs cannot account for events external to the intervention that may produce the outcome. We describe a method to construct a surveillance registry-based comparison group, which allows for estimating the effectiveness of the intervention while controlling for secular trends in the outcome of interest. Using data from the population-based, human immunodeficiency virus Surveillance Registry in New York City, we created a contemporaneous comparison group for persons enrolled in the New York City human immunodeficiency virus Care Coordination Program (CCP) from December 2009 to March 2013. Inclusion in the Registry-based (non-CCP) comparison group required meeting CCP eligibility criteria. To control for secular trends in the outcome, we randomly assigned persons in the non-CCP, Registry-based comparison group a pseudoenrollment date such that the distribution of pseudoenrollment dates matched the distribution of enrollment dates among CCP enrollees. We then matched CCP to non-CCP persons on propensity for enrollment in the CCP, enrollment dates, and baseline viral load. Registry-based comparison group estimates were attenuated relative to pre-post estimates of program effectiveness. These methods have broad applicability for observational intervention effectiveness studies and programmatic evaluations for conditions with surveillance registries.
Assuntos
Vigilância da População , Avaliação de Programas e Projetos de Saúde/métodos , Sistema de Registros , Feminino , Infecções por HIV/terapia , Humanos , Masculino , Pessoa de Meia-Idade , Cidade de Nova Iorque , Administração dos Cuidados ao PacienteRESUMO
Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.
Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Genômica/métodos , Ensaios de Triagem em Larga Escala/métodos , Software , Linguagens de Programação , Interface Usuário-ComputadorRESUMO
Molecular interrogation of a biological sample through DNA sequencing, RNA and microRNA profiling, proteomics and other assays, has the potential to provide a systems level approach to predicting treatment response and disease progression, and to developing precision therapies. Large publicly funded projects have generated extensive and freely available multi-assay data resources; however, bioinformatic and statistical methods for the analysis of such experiments are still nascent. We review multi-assay genomic data resources in the areas of clinical oncology, pharmacogenomics and other perturbation experiments, population genomics and regulatory genomics and other areas, and tools for data acquisition. Finally, we review bioinformatic tools that are explicitly geared toward integrative genomic data visualization and analysis. This review provides starting points for accessing publicly available data and tools to support development of needed integrative methods.
Assuntos
Genômica , Biologia Computacional , MicroRNAs , Análise de Sequência de DNARESUMO
Transcript levels do not faithfully predict protein levels, due to post-transcriptional regulation of gene expression mediated by RNA binding proteins (RBPs) and non-coding RNAs. We developed a multivariate linear regression model integrating RBP levels and predicted RBP-mRNA regulatory interactions from matched transcript and protein datasets. RBPs significantly improved the accuracy in predicting protein abundance of a portion of the total modeled mRNAs in three panels of tissues and cells and for different methods employed in the detection of mRNA and protein. The presence of upstream translation initiation sites (uTISs) at the mRNA 5' untranslated regions was strongly associated with improvement in predictive accuracy. On the basis of these observations, we propose that the recently discovered widespread uTISs in the human genome can be a previously unappreciated substrate of translational control mediated by RBPs.