RESUMEN
The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.
Asunto(s)
Programas Informáticos , Humanos , Biología Computacional/métodos , Leucocitos Mononucleares/metabolismo , Leucocitos Mononucleares/citología , Genómica/métodos , Análisis de DatosRESUMEN
SUMMARY: SpatialExperiment is a new data infrastructure for storing and accessing spatially-resolved transcriptomics data, implemented within the R/Bioconductor framework, which provides advantages of modularity, interoperability, standardized operations and comprehensive documentation. Here, we demonstrate the structure and user interface with examples from the 10x Genomics Visium and seqFISH platforms, and provide access to example datasets and visualization tools in the STexampleData, TENxVisiumData and ggspavis packages. AVAILABILITY AND IMPLEMENTATION: The SpatialExperiment, STexampleData, TENxVisiumData and ggspavis packages are available from Bioconductor. The package versions described in this manuscript are available in Bioconductor version 3.15 onwards. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Programas Informáticos , Transcriptoma , GenómicaRESUMEN
Haematopoietic stem cell dynamics regulate healthy blood cell production and are disrupted during leukaemia. Competition models of cellular species help to elucidate stem cell dynamics in the bone marrow microenvironment (or niche), and to determine how these dynamics impact leukaemia progression. Here we develop two models that target acute myeloid leukaemia with particular focus on the mechanisms that control proliferation via feedback signalling. It is within regions of parameter space permissive of coexistence that the effects of competition are most subtle and the clinical outcome least certain. Steady state and linear stability analyses identify parameter regions that allow for coexistence to occur, and allow us to characterise behaviour near critical points. Where analytical expressions are no longer informative, we proceed statistically and sample parameter space over a coexistence region. We find that the rates of proliferation and differentiation of healthy progenitors exert key control over coexistence. We also show that inclusion of a regulatory feedback onto progenitor cells promotes healthy haematopoiesis at the expense of leukaemia, and that - somewhat paradoxically - within the coexistence region feedback increases the sensitivity of the system to dominance by one lineage over another.
Asunto(s)
Retroalimentación Fisiológica/fisiología , Células Madre Hematopoyéticas/citología , Leucemia Mieloide Aguda/patología , Modelos Biológicos , Células de la Médula Ósea , Diferenciación Celular , Linaje de la Célula/fisiología , Proliferación Celular , Humanos , Cinética , Nicho de Células MadreRESUMEN
The growth of omic data presents evolving challenges in data manipulation, analysis, and integration. Addressing these challenges, Bioconductor1 provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming2 offers a revolutionary standard for data organisation and manipulation. Here, we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning, and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analysing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas3, spanning six data frameworks and ten analysis tools.
RESUMEN
BACKGROUND: With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS: Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS: Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Asunto(s)
Benchmarking , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Simulación por Computador , Análisis por Conglomerados , Análisis de Secuencia de ARN/métodos , Perfilación de la Expresión Génica/métodosRESUMEN
Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for example, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Asunto(s)
Benchmarking , Biología Computacional , Biología Computacional/métodos , Flujo de TrabajoRESUMEN
Ulcerative colitis and Crohn's disease are chronic inflammatory intestinal diseases with perplexing heterogeneity in disease manifestation and response to treatment. While the molecular basis for this heterogeneity remains uncharacterized, single-cell technologies allow us to explore the transcriptional states within tissues at an unprecedented resolution which could further understanding of these complex diseases. Here, we apply single-cell RNA-sequencing to human inflamed intestine and show that the largest differences among patients are present within the myeloid compartment including macrophages and neutrophils. Using spatial transcriptomics in human tissue at single-cell resolution (CosMx Spatial Molecular Imaging) we spatially localize each of the macrophage and neutrophil subsets identified by single-cell RNA-sequencing and unravel further macrophage diversity based on their tissue localization. Finally, single-cell RNA-sequencing combined with single-cell spatial analysis reveals a strong communication network involving macrophages and inflammatory fibroblasts. Our data sheds light on the cellular complexity of these diseases and points towards the myeloid and stromal compartments as important cellular subsets for understanding patient-to-patient heterogeneity.
Asunto(s)
Enfermedad de Crohn , Enfermedades Inflamatorias del Intestino , Humanos , Neutrófilos , Enfermedades Inflamatorias del Intestino/genética , Enfermedad de Crohn/genética , Macrófagos , ARNRESUMEN
The mesothelium lines body cavities and surrounds internal organs, widely contributing to homeostasis and regeneration. Mesothelium disruptions cause visceral anomalies and mesothelioma tumors. Nonetheless, the embryonic emergence of mesothelia remains incompletely understood. Here, we track mesothelial origins in the lateral plate mesoderm (LPM) using zebrafish. Single-cell transcriptomics uncovers a post-gastrulation gene expression signature centered on hand2 in distinct LPM progenitor cells. We map mesothelial progenitors to lateral-most, hand2-expressing LPM and confirm conservation in mouse. Time-lapse imaging of zebrafish hand2 reporter embryos captures mesothelium formation including pericardium, visceral, and parietal peritoneum. We find primordial germ cells migrate with the forming mesothelium as ventral migration boundary. Functionally, hand2 loss disrupts mesothelium formation with reduced progenitor cells and perturbed migration. In mouse and human mesothelioma, we document expression of LPM-associated transcription factors including Hand2, suggesting re-initiation of a developmental program. Our data connects mesothelium development to Hand2, expanding our understanding of mesothelial pathologies.
Asunto(s)
Mesotelioma , Pez Cebra , Animales , Factores de Transcripción con Motivo Hélice-Asa-Hélice Básico/genética , Factores de Transcripción con Motivo Hélice-Asa-Hélice Básico/metabolismo , Epitelio/metabolismo , Mesotelioma/genética , Ratones , Factores de Transcripción/metabolismo , Proteínas de Pez Cebra/genética , Proteínas de Pez Cebra/metabolismoRESUMEN
A key challenge in single-cell RNA-sequencing (scRNA-seq) data analysis is batch effects that can obscure the biological signal of interest. Although there are various tools and methods to correct for batch effects, their performance can vary. Therefore, it is important to understand how batch effects manifest to adjust for them. Here, we systematically explore batch effects across various scRNA-seq datasets according to magnitude, cell type specificity, and complexity. We developed a cell-specific mixing score (cms) that quantifies mixing of cells from multiple batches. By considering distance distributions, the score is able to detect local batch bias as well as differentiate between unbalanced batches and systematic differences between cells of the same cell type. We compare metrics in scRNA-seq data using real and synthetic datasets and whereas these metrics target the same question and are used interchangeably, we find differences in scalability, sensitivity, and ability to handle differentially abundant cell types. We find that cell-specific metrics outperform cell type-specific and global metrics and recommend them for both method benchmarks and batch exploration.
Asunto(s)
Análisis de Secuencia de ARN/métodos , Análisis de Secuencia/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Artefactos , Secuencia de Bases/genética , Análisis de Datos , Perfilación de la Expresión Génica/métodos , Humanos , RNA-Seq/métodos , Programas Informáticos , Secuenciación del Exoma/métodosRESUMEN
Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor's SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology.
RESUMEN
Single-cell RNA sequencing (scRNA-seq) has become an empowering technology to profile the transcriptomes of individual cells on a large scale. Early analyses of differential expression have aimed at identifying differences between subpopulations to identify subpopulation markers. More generally, such methods compare expression levels across sets of cells, thus leading to cross-condition analyses. Given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making sample-level inferences, termed here as differential state analysis; however, it is not clear which statistical framework best handles this situation. Here, we surveyed methods to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated pseudobulk data. To evaluate method performance, we developed a flexible simulation that mimics multi-sample scRNA-seq data. We analyzed scRNA-seq data from mouse cortex cells to uncover subpopulation-specific responses to lipopolysaccharide treatment, and provide robust tools for multi-condition analysis within the muscat R package.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Transcriptoma , Animales , Corteza Cerebelosa/efectos de los fármacos , Corteza Cerebelosa/metabolismo , Análisis por Conglomerados , Biología Computacional , Simulación por Computador , Lipopolisacáridos/efectos adversos , Masculino , Ratones , Modelos Estadísticos , ARN Citoplasmático Pequeño , Transducción de Señal , Programas InformáticosRESUMEN
The advent of mass cytometry increased the number of parameters measured at the single-cell level while decreasing the extent of crosstalk between channels relative to dye-based flow cytometry. Although reduced, spillover still exists in mass cytometry data, and minimizing its effect requires considerable expert knowledge and substantial experimental effort. Here, we describe a novel bead-based compensation workflow and R-based software that estimates and corrects for interference between channels. We performed an in-depth characterization of the spillover properties in mass cytometry, including limitations defined by the linear range of the mass cytometer and the reproducibility of the spillover over time and across machines. We demonstrated the utility of our method in suspension and imaging mass cytometry. To conclude, our approach greatly simplifies the development of new antibody panels, increases flexibility for antibody-metal pairing, opens the way to using less pure isotopes, and improves overall data quality, thereby reducing the risk of reporting cell phenotype artifacts.
Asunto(s)
Citometría de Flujo/métodos , Citometría de Imagen/métodos , Anticuerpos/inmunología , Neoplasias de la Mama/patología , Femenino , Humanos , Inmunofenotipificación/métodos , Reproducibilidad de los Resultados , Relación Señal-Ruido , Análisis de la Célula Individual/métodos , Programas Informáticos , SuspensionesRESUMEN
High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals).