RESUMEN
A recent paper claimed that t-SNE and UMAP embeddings of single-cell datasets are "specious" and fail to capture true biological structure. The authors argued that such embeddings are as arbitrary and as misleading as forcing the data into an elephant shape. Here we show that this conclusion was based on inadequate and limited metrics of embedding quality. More appropriate metrics quantifying neighborhood and class preservation reveal the elephant in the room: while t-SNE and UMAP embeddings of single-cell data do not preserve high-dimensional distances, they can nevertheless provide biologically relevant information.
Asunto(s)
Biología Computacional , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , Biología Computacional/métodos , Algoritmos , Humanos , AnimalesRESUMEN
Dimension reduction tools preserving similarity and graph structure such as t-SNE and UMAP can capture complex biological patterns in high-dimensional data. However, these tools typically are not designed to separate effects of interest from unwanted effects due to confounders. We introduce the partial embedding (PARE) framework, which enables removal of confounders from any distance-based dimension reduction method. We then develop partial t-SNE and partial UMAP and apply these methods to genomic and neuroimaging data. For lower-dimensional visualization, our results show that the PARE framework can remove batch effects in single-cell sequencing data as well as separate clinical and technical variability in neuroimaging measures. We demonstrate that the PARE framework extends dimension reduction methods to highlight biological patterns of interest while effectively removing confounding effects.
Asunto(s)
Algoritmos , Biología Computacional , Neuroimagen , Humanos , Neuroimagen/métodos , Biología Computacional/métodos , Genómica/métodos , Genómica/estadística & datos numéricos , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricosRESUMEN
Single-cell ATAC-seq sequencing data (scATAC-seq) has been widely used to investigate chromatin accessibility on the single-cell level. One important application of scATAC-seq data analysis is differential chromatin accessibility (DA) analysis. However, the data characteristics of scATAC-seq such as excessive zeros and large variability of chromatin accessibility across cells impose a unique challenge for DA analysis. Existing statistical methods focus on detecting the mean difference of the chromatin accessible regions while overlooking the distribution difference. Motivated by real data exploration that distribution difference exists among cell types, we introduce a novel composite statistical test named "scaDA", which is based on zero-inflated negative binomial model (ZINB), for performing differential distribution analysis of chromatin accessibility by jointly testing the abundance, prevalence and dispersion simultaneously. Benefiting from both dispersion shrinkage and iterative refinement of mean and prevalence parameter estimates, scaDA demonstrates its superiority to both ZINB-based likelihood ratio tests and published methods by achieving the highest power and best FDR control in a comprehensive simulation study. In addition to demonstrating the highest power in three real sc-multiome data analyses, scaDA successfully identifies differentially accessible regions in microglia from sc-multiome data for an Alzheimer's disease (AD) study that are most enriched in GO terms related to neurogenesis and the clinical phenotype of AD, and AD-associated GWAS SNPs.
Asunto(s)
Cromatina , Análisis de la Célula Individual , Cromatina/genética , Cromatina/metabolismo , Cromatina/química , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , Humanos , Biología Computacional/métodos , Enfermedad de Alzheimer/genética , Modelos Estadísticos , Secuenciación de Inmunoprecipitación de Cromatina/métodos , Simulación por Computador , Animales , Análisis de Secuencia de ADN/métodos , AlgoritmosRESUMEN
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool in genomics research, enabling the analysis of gene expression at the individual cell level. However, scRNA-seq data often suffer from a high rate of dropouts, where certain genes fail to be detected in specific cells due to technical limitations. This missing data can introduce biases and hinder downstream analysis. To overcome this challenge, the development of effective imputation methods has become crucial in the field of scRNA-seq data analysis. Here, we propose an imputation method based on robust and non-negative matrix factorization (scRNMF). Instead of other matrix factorization algorithms, scRNMF integrates two loss functions: L2 loss and C-loss. The L2 loss function is highly sensitive to outliers, which can introduce substantial errors. We utilize the C-loss function when dealing with zero values in the raw data. The primary advantage of the C-loss function is that it imposes a smaller punishment for larger errors, which results in more robust factorization when handling outliers. Various datasets of different sizes and zero rates are used to evaluate the performance of scRNMF against other state-of-the-art methods. Our method demonstrates its power and stability as a tool for imputation of scRNA-seq data.
Asunto(s)
Algoritmos , Biología Computacional , RNA-Seq , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , RNA-Seq/métodos , RNA-Seq/estadística & datos numéricos , Biología Computacional/métodos , Humanos , Análisis de Secuencia de ARN/métodos , Análisis de Secuencia de ARN/estadística & datos numéricos , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Programas Informáticos , Análisis de Expresión Génica de una Sola CélulaRESUMEN
Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. To address these challenges, we developed a novel analyses framework, Single-Cell Path Metrics Profiling (scPMP), using power-weighted path metrics, which measure distances between cells in a data-driven way. Unlike Euclidean distance and other commonly used distance metrics, path metrics are density sensitive and respect the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which preserves both the global data geometry and cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms current scRNAseq clustering algorithms on a wide range of benchmarking data sets.
Asunto(s)
Algoritmos , Biología Computacional , Análisis de la Célula Individual , Análisis por Conglomerados , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , Humanos , Biología Computacional/métodos , RNA-Seq/métodos , RNA-Seq/estadística & datos numéricos , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Análisis de Secuencia de ARN/métodos , Análisis de Secuencia de ARN/estadística & datos numéricos , Análisis de Expresión Génica de una Sola CélulaRESUMEN
Boolean networks are largely employed to model the qualitative dynamics of cell fate processes by describing the change of binary activation states of genes and transcription factors with time. Being able to bridge such qualitative states with quantitative measurements of gene expression in cells, as scRNA-seq, is a cornerstone for data-driven model construction and validation. On one hand, scRNA-seq binarisation is a key step for inferring and validating Boolean models. On the other hand, the generation of synthetic scRNA-seq data from baseline Boolean models provides an important asset to benchmark inference methods. However, linking characteristics of scRNA-seq datasets, including dropout events, with Boolean states is a challenging task. We present scBoolSeq, a method for the bidirectional linking of scRNA-seq data and Boolean activation state of genes. Given a reference scRNA-seq dataset, scBoolSeq computes statistical criteria to classify the empirical gene pseudocount distributions as either unimodal, bimodal, or zero-inflated, and fit a probabilistic model of dropouts, with gene-dependent parameters. From these learnt distributions, scBoolSeq can perform both binarisation of scRNA-seq datasets, and generate synthetic scRNA-seq datasets from Boolean traces, as issued from Boolean networks, using biased sampling and dropout simulation. We present a case study demonstrating the application of scBoolSeq's binarisation scheme in data-driven model inference. Furthermore, we compare synthetic scRNA-seq data generated by scBoolSeq with BoolODE's, data for the same Boolean Network model. The comparison shows that our method better reproduces the statistics of real scRNA-seq datasets, such as the mean-variance and mean-dropout relationships while exhibiting clearly defined trajectories in two-dimensional projections of the data.
Asunto(s)
Biología Computacional , Análisis de la Célula Individual , Biología Computacional/métodos , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , Humanos , RNA-Seq/métodos , RNA-Seq/estadística & datos numéricos , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Análisis de Secuencia de ARN/métodos , Análisis de Secuencia de ARN/estadística & datos numéricos , Algoritmos , Redes Reguladoras de Genes/genética , Modelos Estadísticos , Programas Informáticos , Análisis de Expresión Génica de una Sola CélulaRESUMEN
The recent maturation of single-cell RNA sequencing (scRNA-seq) technologies has coincided with transformative new methods to profile genetic, epigenetic, spatial, proteomic and lineage information in individual cells. This provides unique opportunities, alongside computational challenges, for integrative methods that can jointly learn across multiple types of data. Integrated analysis can discover relationships across cellular modalities, learn a holistic representation of the cell state, and enable the pooling of data sets produced across individuals and technologies. In this Review, we discuss the recent advances in the collection and integration of different data types at single-cell resolution with a focus on the integration of gene expression data with other types of single-cell measurement.
Asunto(s)
Biología Computacional/métodos , Minería de Datos/estadística & datos numéricos , ARN/genética , Análisis de la Célula Individual/estadística & datos numéricos , Conjuntos de Datos como Asunto , Epigénesis Genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Proteínas/genética , Proteínas/metabolismo , ARN/química , ARN/metabolismo , Análisis de la Célula Individual/métodosRESUMEN
Single-cell RNA sequencing (scRNA-seq) allows researchers to collect large catalogues detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance for the analysis of these data, as it is used to identify putative cell types. However, there are many challenges involved. We discuss why clustering is a challenging problem from a computational point of view and what aspects of the data make it challenging. We also consider the difficulties related to the biological interpretation and annotation of the identified clusters.
Asunto(s)
Linaje de la Célula/genética , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , ARN Mensajero/genética , Análisis de la Célula Individual/estadística & datos numéricos , Transcriptoma , Análisis por Conglomerados , Epigénesis Genética , Células Eucariotas/clasificación , Células Eucariotas/citología , Células Eucariotas/metabolismo , Perfilación de la Expresión Génica , Humanos , ARN Mensajero/química , ARN Mensajero/metabolismo , Análisis de la Célula Individual/métodos , Aprendizaje Automático no SupervisadoRESUMEN
The rapid progress of protocols for sequencing single-cell transcriptomes over the past decade has been accompanied by equally impressive advances in the computational methods for analysis of such data. As capacity and accuracy of the experimental techniques grew, the emerging algorithm developments revealed increasingly complex facets of the underlying biology, from cell type composition to gene regulation to developmental dynamics. At the same time, rapid growth has forced continuous reevaluation of the underlying statistical models, experimental aims, and sheer volumes of data processing that are handled by these computational tools. Here, I review key computational steps of single-cell RNA sequencing (scRNA-seq) analysis, examine assumptions made by different approaches, and highlight successes, remaining ambiguities, and limitations that are important to keep in mind as scRNA-seq becomes a mainstream technique for studying biology.
Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Animales , Linfocitos T CD8-positivos/citología , Linfocitos T CD8-positivos/fisiología , Gráficos por Computador , Bases de Datos Genéticas , Humanos , Ratones , Análisis de Componente Principal , Análisis de Secuencia de ARN/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricos , Transcripción GenéticaRESUMEN
The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.
Asunto(s)
Algoritmos , Simulación por Computador , Modelos Estadísticos , Análisis de la Célula Individual , Humanos , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , Análisis Factorial , Dinámicas no LinealesRESUMEN
Tumor is a complex and aggressive type of disease that poses significant health challenges. Understanding the cellular mechanisms underlying its progression is crucial for developing effective treatments. In this study, we develop a novel mathematical framework to investigate the role of cellular plasticity and heterogeneity in tumor progression. By leveraging temporal single-cell data, we propose a reaction-convection-diffusion model that effectively captures the spatiotemporal dynamics of tumor cells and macrophages within the tumor microenvironment. Through theoretical analysis, we obtain the estimate of the pulse wave speed and analyze the stability of the homogeneous steady state solutions. Notably, we employe the AddModuleScore function to quantify cellular plasticity. One of the highlights of our approach is the introduction of pulse wave speed as a quantitative measure to precisely gauge the rate of cell phenotype transitions, as well as the novel implementation of the high-plasticity cell state/low-plasticity cell state ratio as an indicator of tumor malignancy. Furthermore, the bifurcation analysis reveals the complex dynamics of tumor cell populations. Our extensive analysis demonstrates that an increased rate of phenotype transition is associated with heightened malignancy, attributable to the tumor's ability to explore a wider phenotypic space. The study also investigates how the proliferation rate and the death rate of tumor cells, phenotypic convection velocity, and the midpoint of the phenotype transition stage affect the speed of tumor cell phenotype transitions and the progression to adenocarcinoma. These insights and quantitative measures can help guide the development of targeted therapeutic strategies to regulate cellular plasticity and control tumor progression effectively.
Asunto(s)
Plasticidad de la Célula , Conceptos Matemáticos , Modelos Biológicos , Neoplasias , Fenotipo , Análisis de la Célula Individual , Microambiente Tumoral , Humanos , Microambiente Tumoral/fisiología , Neoplasias/patología , Neoplasias/fisiopatología , Análisis de la Célula Individual/estadística & datos numéricos , Progresión de la Enfermedad , Proliferación Celular , Simulación por ComputadorRESUMEN
BACKGROUND: Single-cell sequencing technologies have advanced our understanding of kidney biology and disease, but the loss of spatial information in these datasets hinders our interpretation of intercellular communication networks and regional gene expression patterns. New spatial transcriptomic sequencing platforms make it possible to measure the topography of gene expression at genome depth. METHODS: We optimized and validated a female bilateral ischemia-reperfusion injury model. Using the 10× Genomics Visium Spatial Gene Expression solution, we generated spatial maps of gene expression across the injury and repair time course, and applied two open-source computational tools, Giotto and SPOTlight, to increase resolution and measure cell-cell interaction dynamics. RESULTS: An ischemia time of 34 minutes in a female murine model resulted in comparable injury to 22 minutes for males. We report a total of 16,856 unique genes mapped across our injury and repair time course. Giotto, a computational toolbox for spatial data analysis, enabled increased resolution mapping of genes and cell types. Using a seeded nonnegative matrix regression (SPOTlight) to deconvolute the dynamic landscape of cell-cell interactions, we found that injured proximal tubule cells were characterized by increasing macrophage and lymphocyte interactions even 6 weeks after injury, potentially reflecting the AKI to CKD transition. CONCLUSIONS: In this transcriptomic atlas, we defined region-specific and injury-induced loss of differentiation markers and their re-expression during repair, as well as region-specific injury and repair transcriptional responses. Lastly, we created an interactive data visualization application for the scientific community to explore these results (http://humphreyslab.com/SingleCell/).
Asunto(s)
Lesión Renal Aguda/genética , Lesión Renal Aguda/patología , Lesión Renal Aguda/fisiopatología , Animales , Comunicación Celular/genética , Modelos Animales de Enfermedad , Femenino , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Ratones , Ratones Endogámicos C57BL , RNA-Seq , Daño por Reperfusión/genética , Daño por Reperfusión/patología , Daño por Reperfusión/fisiopatología , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos , Programas InformáticosRESUMEN
Cell sorting, whereby a heterogeneous cell mixture segregates and forms distinct homogeneous tissues, is one of the main collective cell behaviors at work during development. Although differences in interfacial energies are recognized to be a possible driving source for cell sorting, no clear consensus has emerged on the kinetic law of cell sorting driven by differential adhesion. Using a modified Cellular Potts Model algorithm that allows for efficient simulations while preserving the connectivity of cells, we numerically explore cell-sorting dynamics over very large scales in space and time. For a binary mixture of cells surrounded by a medium, increase of domain size follows a power-law with exponent n = 1/4 independently of the mixture ratio, revealing that the kinetics is dominated by the diffusion and coalescence of rounded domains. We compare these results with recent numerical studies on cell sorting, and discuss the importance of algorithmic differences as well as boundary conditions on the observed scaling.
Asunto(s)
Adhesión Celular/fisiología , Agregación Celular/fisiología , Modelos Biológicos , Algoritmos , Animales , Fenómenos Biofísicos , Movimiento Celular/fisiología , Biología Computacional , Simulación por Computador , Humanos , Cinética , Análisis de la Célula Individual/estadística & datos numéricos , Tensión SuperficialRESUMEN
Clustering high-dimensional data, such as images or biological measurements, is a long-standing problem and has been studied extensively. Recently, Deep Clustering has gained popularity due to its flexibility in fitting the specific peculiarities of complex data. Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering model. The model can learn multi-modal distributions of high-dimensional data and use these to generate realistic data with high efficacy and efficiency. MoE-Sim-VAE is based on a Variational Autoencoder (VAE), where the decoder consists of a Mixture-of-Experts (MoE) architecture. This specific architecture allows for various modes of the data to be automatically learned by means of the experts. Additionally, we encourage the lower dimensional latent representation of our model to follow a Gaussian mixture distribution and to accurately represent the similarities between the data points. We assess the performance of our model on the MNIST benchmark data set and challenging real-world tasks of clustering mouse organs from single-cell RNA-sequencing measurements and defining cell subpopulations from mass cytometry (CyTOF) measurements on hundreds of different datasets. MoE-Sim-VAE exhibits superior clustering performance on all these tasks in comparison to the baselines as well as competitor methods.
Asunto(s)
Análisis de la Célula Individual/estadística & datos numéricos , Animales , Análisis por Conglomerados , Biología Computacional , Aprendizaje Profundo , Perfilación de la Expresión Génica/estadística & datos numéricos , Leucocitos Mononucleares/clasificación , Ratones , Modelos Biológicos , Distribución Normal , Especificidad de Órganos , Fenotipo , RNA-Seq/estadística & datos numéricosRESUMEN
The single-cell RNA sequencing (scRNA-seq) technologies obtain gene expression at single-cell resolution and provide a tool for exploring cell heterogeneity and cell types. As the low amount of extracted mRNA copies per cell, scRNA-seq data exhibit a large number of dropouts, which hinders the downstream analysis of the scRNA-seq data. We propose a statistical method, SDImpute (Single-cell RNA-seq Dropout Imputation), to implement block imputation for dropout events in scRNA-seq data. SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells. In the experiments, the results of the simulated datasets and real datasets suggest that SDImpute is an effective tool to recover the data and preserve the heterogeneity of gene expression across cells. Compared with the state-of-the-art imputation methods, SDImpute improves the accuracy of the downstream analysis including clustering, visualization, and differential expression analysis.
Asunto(s)
RNA-Seq/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricos , Programas Informáticos , Animales , Análisis por Conglomerados , Biología Computacional , Simulación por Computador , Interpretación Estadística de Datos , Visualización de Datos , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Técnicas Genéticas/estadística & datos numéricos , Humanos , ARN Mensajero/genética , ARN Mensajero/aislamiento & purificaciónRESUMEN
Technological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, mouse cortex sc-methylation and scRNA-seq data, and human blood dendritic cells scRNA-seq data from two batches, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC_plus.
Asunto(s)
Genómica/estadística & datos numéricos , Aprendizaje Automático , Programas Informáticos , Animales , Corteza Cerebral/metabolismo , Análisis por Conglomerados , Biología Computacional , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Células Dendríticas/metabolismo , Humanos , Teoría de la Información , Ratones , ARN Citoplasmático Pequeño/genética , RNA-Seq , Análisis de la Célula Individual/estadística & datos numéricosRESUMEN
Synchronous oscillations in neural populations are considered being controlled by inhibitory neurons. In the granular layer of the cerebellum, two major types of cells are excitatory granular cells (GCs) and inhibitory Golgi cells (GoCs). GC spatiotemporal dynamics, as the output of the granular layer, is highly regulated by GoCs. However, there are various types of inhibition implemented by GoCs. With inputs from mossy fibers, GCs and GoCs are reciprocally connected to exhibit different network motifs of synaptic connections. From the view of GCs, feedforward inhibition is expressed as the direct input from GoCs excited by mossy fibers, whereas feedback inhibition is from GoCs via GCs themselves. In addition, there are abundant gap junctions between GoCs showing another form of inhibition. It remains unclear how these diverse copies of inhibition regulate neural population oscillation changes. Leveraging a computational model of the granular layer network, we addressed this question to examine the emergence and modulation of network oscillation using different types of inhibition. We show that at the network level, feedback inhibition is crucial to generate neural oscillation. When short-term plasticity was equipped on GoC-GC synapses, oscillations were largely diminished. Robust oscillations can only appear with additional gap junctions. Moreover, there was a substantial level of cross-frequency coupling in oscillation dynamics. Such a coupling was adjusted and strengthened by GoCs through feedback inhibition. Taken together, our results suggest that the cooperation of distinct types of GoC inhibition plays an essential role in regulating synchronous oscillations of the GC population. With GCs as the sole output of the granular network, their oscillation dynamics could potentially enhance the computational capability of downstream neurons.
Asunto(s)
Corteza Cerebelosa/citología , Corteza Cerebelosa/fisiología , Modelos Neurológicos , Animales , Biología Computacional , Sinapsis Eléctricas/fisiología , Potenciales Postsinápticos Excitadores/fisiología , Retroalimentación Fisiológica , Humanos , Potenciales Postsinápticos Inhibidores/fisiología , Fibras Nerviosas/fisiología , Red Nerviosa/citología , Red Nerviosa/fisiología , Vías Nerviosas/fisiología , Plasticidad Neuronal/fisiología , Neuronas/fisiología , Análisis de la Célula Individual/estadística & datos numéricos , Sinapsis/fisiologíaRESUMEN
Single cell RNA sequencing (scRNAseq) can be used to infer a temporal ordering of cellular states. Current methods for the inference of cellular trajectories rely on unbiased dimensionality reduction techniques. However, such biologically agnostic ordering can prove difficult for modeling complex developmental or differentiation processes. The cellular heterogeneity of dynamic biological compartments can result in sparse sampling of key intermediate cell states. To overcome these limitations, we develop a supervised machine learning framework, called Pseudocell Tracer, which infers trajectories in pseudospace rather than in pseudotime. The method uses a supervised encoder, trained with adjacent biological information, to project scRNAseq data into a low-dimensional manifold that maps the transcriptional states a cell can occupy. Then a generative adversarial network (GAN) is used to simulate pesudocells at regular intervals along a virtual cell-state axis. We demonstrate the utility of Pseudocell Tracer by modeling B cells undergoing immunoglobulin class switch recombination (CSR) during a prototypic antigen-induced antibody response. Our results revealed an ordering of key transcription factors regulating CSR to the IgG1 isotype, including the concomitant expression of Nfkb1 and Stat6 prior to the upregulation of Bach2 expression. Furthermore, the expression dynamics of genes encoding cytokine receptors suggest a poised IL-4 signaling state that preceeds CSR to the IgG1 isotype.
Asunto(s)
Linfocitos B/inmunología , Cambio de Clase de Inmunoglobulina/genética , Aprendizaje Automático Supervisado , Animales , Linfocitos B/metabolismo , Factores de Transcripción con Cremalleras de Leucina de Carácter Básico/genética , Biología Computacional , Simulación por Computador , Bases de Datos de Ácidos Nucleicos , Expresión Génica , Inmunoglobulina G/genética , Interleucina-4/inmunología , Ratones , Ratones Endogámicos C57BL , Modelos Inmunológicos , Subunidad p50 de NF-kappa B/genética , Redes Neurales de la Computación , RNA-Seq/métodos , RNA-Seq/estadística & datos numéricos , Receptores de Citocinas/genética , Recombinación Genética , Factor de Transcripción STAT6/genética , Transducción de Señal , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricosRESUMEN
The blood system is often represented as a tree-like structure with stem cells that give rise to mature blood cell types through a series of demarcated steps. Although this representation has served as a model of hierarchical tissue organization for decades, single-cell technologies are shedding new light on the abundance of cell type intermediates and the molecular mechanisms that ensure balanced replenishment of differentiated cells. In this Brief Review, we exemplify new insights into blood cell differentiation generated by single-cell RNA sequencing, summarize considerations for the application of this technology, and highlight innovations that are leading the way to understand hematopoiesis at the resolution of single cells. Graphic Abstract: A graphic abstract is available for this article.
Asunto(s)
Hematopoyesis/genética , RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Animales , Biología Computacional/métodos , Biología Computacional/tendencias , Células Madre Hematopoyéticas/citología , Células Madre Hematopoyéticas/metabolismo , Humanos , RNA-Seq/estadística & datos numéricos , RNA-Seq/tendencias , Análisis de la Célula Individual/estadística & datos numéricos , Análisis de la Célula Individual/tendenciasRESUMEN
Clustering is an essential step in the analysis of single cell RNA-seq (scRNA-seq) data to shed light on tissue complexity including the number of cell types and transcriptomic signatures of each cell type. Due to its importance, novel methods have been developed recently for this purpose. However, different approaches generate varying estimates regarding the number of clusters and the single-cell level cluster assignments. This type of unsupervised clustering is challenging and it is often times hard to gauge which method to use because none of the existing methods outperform others across all scenarios. We present SAME-clustering, a mixture model-based approach that takes clustering solutions from multiple methods and selects a maximally diverse subset to produce an improved ensemble solution. We tested SAME-clustering across 15 scRNA-seq datasets generated by different platforms, with number of clusters varying from 3 to 15, and number of single cells from 49 to 32 695. Results show that our SAME-clustering ensemble method yields enhanced clustering, in terms of both cluster assignments and number of clusters. The mixture model ensemble clustering is not limited to clustering scRNA-seq data and may be useful to a wide range of clustering applications.