RESUMO
Cell population delineation and identification is an essential step in single-cell and spatial-omics studies. Spatial-omics technologies can simultaneously measure information from three complementary domains related to this task: expression levels of a panel of molecular biomarkers at single-cell resolution, relative positions of cells, and images of tissue sections, but existing computational methods for performing this task on single-cell spatial-omics datasets often relinquish information from one or more domains. The additional reliance on the availability of "atlas" training or reference datasets limits cell type discovery to well-defined but limited cell population labels, thus posing major challenges for using these methods in practice. Successful integration of all three domains presents an opportunity for uncovering cell populations that are functionally stratified by their spatial contexts at cellular and tissue levels: the key motivation for employing spatial-omics technologies in the first place. In this work, we introduce Cell Spatio- and Neighborhood-informed Annotation and Patterning (CellSNAP), a self-supervised computational method that learns a representation vector for each cell in tissue samples measured by spatial-omics technologies at the single-cell or finer resolution. The learned representation vector fuses information about the corresponding cell across all three aforementioned domains. By applying CellSNAP to datasets spanning both spatial proteomic and spatial transcriptomic modalities, and across different tissue types and disease settings, we show that CellSNAP markedly enhances de novo discovery of biologically relevant cell populations at fine granularity, beyond current approaches, by fully integrating cells' molecular profiles with cellular neighborhood and tissue image information.
RESUMO
Although single-cell and spatial sequencing methods enable simultaneous measurement of more than one biological modality, no technology can capture all modalities within the same cell. For current data integration methods, the feasibility of cross-modal integration relies on the existence of highly correlated, a priori 'linked' features. We describe matching X-modality via fuzzy smoothed embedding (MaxFuse), a cross-modal data integration method that, through iterative coembedding, data smoothing and cell matching, uses all information in each modality to obtain high-quality integration even when features are weakly linked. MaxFuse is modality-agnostic and demonstrates high robustness and accuracy in the weak linkage scenario, achieving 20~70% relative improvement over existing methods under key evaluation metrics on benchmarking datasets. A prototypical example of weak linkage is the integration of spatial proteomic data with single-cell sequencing data. On two example analyses of this type, MaxFuse enabled the spatial consolidation of proteomic, transcriptomic and epigenomic information at single-cell resolution on the same tissue section.
RESUMO
The intestine is a complex organ that promotes digestion, extracts nutrients, participates in immune surveillance, maintains critical symbiotic relationships with microbiota and affects overall health1. The intesting has a length of over nine metres, along which there are differences in structure and function2. The localization of individual cell types, cell type development trajectories and detailed cell transcriptional programs probably drive these differences in function. Here, to better understand these differences, we evaluated the organization of single cells using multiplexed imaging and single-nucleus RNA and open chromatin assays across eight different intestinal sites from nine donors. Through systematic analyses, we find cell compositions that differ substantially across regions of the intestine and demonstrate the complexity of epithelial subtypes, and find that the same cell types are organized into distinct neighbourhoods and communities, highlighting distinct immunological niches that are present in the intestine. We also map gene regulatory differences in these cells that are suggestive of a regulatory differentiation cascade, and associate intestinal disease heritability with specific cell types. These results describe the complexity of the cell composition, regulation and organization for this organ, and serve as an important reference map for understanding human biology and disease.
Assuntos
Intestinos , Análise de Célula Única , Humanos , Diferenciação Celular/genética , Cromatina/genética , Células Epiteliais/citologia , Células Epiteliais/metabolismo , Regulação da Expressão Gênica , Mucosa Intestinal/citologia , Intestinos/citologia , Intestinos/imunologia , Análise da Expressão Gênica de Célula ÚnicaRESUMO
Data integration to align cells across batches has become a cornerstone of single cell data analysis, critically affecting downstream results. Yet, how much biological signal is erased during integration? Currently, there are no guidelines for when the biological differences between samples are separable from batch effects, and thus, data integration usually involve a lot of guesswork: Cells across batches should be aligned to be "appropriately" mixed, while preserving "main cell type clusters". We show evidence that current paradigms for single cell data integration are unnecessarily aggressive, removing biologically meaningful variation. To remedy this, we present a novel statistical model and computationally scalable algorithm, CellANOVA, to recover biological signal that is lost during single cell data integration. CellANOVA utilizes a "pool-of-controls" design concept, applicable across diverse settings, to separate unwanted variation from biological variation of interest. When applied with existing integration methods, CellANOVA allows the recovery of subtle biological signals and corrects, to a large extent, the data distortion introduced by integration. Further, CellANOVA explicitly estimates cell- and gene-specific batch effect terms which can be used to identify the cell types and pathways exhibiting the largest batch variations, providing clarity as to which biological signals can be recovered. These concepts are illustrated on studies of diverse designs, where the biological signals that are recovered by CellANOVA are shown to be validated by orthogonal assays. In particular, we show that CellANOVA is effective in the challenging case of single-cell and single-nuclei data integration, where the recovered biological signals are replicated in an independent study.
RESUMO
The ability to align individual cellular information from multiple experimental sources is fundamental for a systems-level understanding of biological processes. However, currently available tools are mainly designed for single-cell transcriptomics matching and integration, and generally rely on a large number of shared features across datasets for cell matching. This approach underperforms when applied to single-cell proteomic datasets due to the limited number of parameters simultaneously accessed and lack of shared markers across these experiments. Here, we introduce a cell-matching algorithm, matching with partial overlap (MARIO) that accounts for both shared and distinct features, while consisting of vital filtering steps to avoid suboptimal matching. MARIO accurately matches and integrates data from different single-cell proteomic and multimodal methods, including spatial techniques and has cross-species capabilities. MARIO robustly matched tissue macrophages identified from COVID-19 lung autopsies via codetection by indexing imaging to macrophages recovered from COVID-19 bronchoalveolar lavage fluid by cellular indexing of transcriptomes and epitopes by sequencing, revealing unique immune responses within the lung microenvironment of patients with COVID.
Assuntos
COVID-19 , Proteômica , Humanos , Proteômica/métodos , Perfilação da Expressão Gênica/métodos , Transcriptoma , Pulmão , Análise de Célula Única/métodosRESUMO
single-cell sequencing methods have enabled the profiling of multiple types of molecular readouts at cellular resolution, and recent developments in spatial barcoding, in situ hybridization, and in situ sequencing allow such molecular readouts to retain their spatial context. Since no technology can provide complete characterization across all layers of biological modalities within the same cell, there is pervasive need for computational cross-modal integration (also called diagonal integration) of single-cell and spatial omics data. For current methods, the feasibility of cross-modal integration relies on the existence of highly correlated, a priori "linked" features. When such linked features are few or uninformative, a scenario that we call "weak linkage", existing methods fail. We developed MaxFuse, a cross-modal data integration method that, through iterative co-embedding, data smoothing, and cell matching, leverages all information in each modality to obtain high-quality integration. MaxFuse is modality-agnostic and, through comprehensive benchmarks on single-cell and spatial ground-truth multiome datasets, demonstrates high robustness and accuracy in the weak linkage scenario. A prototypical example of weak linkage is the integration of spatial proteomic data with single-cell sequencing data. On two example analyses of this type, we demonstrate how MaxFuse enables the spatial consolidation of proteomic, transcriptomic and epigenomic information at single-cell resolution on the same tissue section.
RESUMO
Brain networks are increasingly characterized at different scales, including summary statistics, community connectivity, and individual edges. While research relating brain networks to behavioral measurements has yielded many insights into brain-phenotype relationships, common analytical approaches only consider network information at a single scale. Here, we designed, implemented, and deployed Multi-Scale Network Regression (MSNR), a penalized multivariate approach for modeling brain networks that explicitly respects both edge- and community-level information by assuming a low rank and sparse structure, both encouraging less complex and more interpretable modeling. Capitalizing on a large neuroimaging cohort (n = 1, 051), we demonstrate that MSNR recapitulates interpretable and statistically significant connectivity patterns associated with brain development, sex differences, and motion-related artifacts. Compared to single-scale methods, MSNR achieves a balance between prediction performance and model complexity, with improved interpretability. Together, by jointly exploiting both edge- and community-level information, MSNR has the potential to yield novel insights into brain-behavior relationships.
Assuntos
Encéfalo/fisiologia , Conectoma/métodos , Imageamento por Ressonância Magnética/métodos , Modelos Estatísticos , Rede Nervosa/fisiologia , Adolescente , Encéfalo/diagnóstico por imagem , Estudos Transversais , Feminino , Humanos , Individualidade , Masculino , Rede Nervosa/diagnóstico por imagem , Fenótipo , Análise de Regressão , Caracteres SexuaisRESUMO
Neurobiological abnormalities associated with psychiatric disorders do not map well to existing diagnostic categories. High co-morbidity suggests dimensional circuit-level abnormalities that cross diagnoses. Here we seek to identify brain-based dimensions of psychopathology using sparse canonical correlation analysis in a sample of 663 youths. This analysis reveals correlated patterns of functional connectivity and psychiatric symptoms. We find that four dimensions of psychopathology - mood, psychosis, fear, and externalizing behavior - are associated (r = 0.68-0.71) with distinct patterns of connectivity. Loss of network segregation between the default mode network and executive networks emerges as a common feature across all dimensions. Connectivity linked to mood and psychosis becomes more prominent with development, and sex differences are present for connectivity related to mood and fear. Critically, findings largely replicate in an independent dataset (n = 336). These results delineate connectivity-guided dimensions of psychopathology that cross clinical diagnostic categories, which could serve as a foundation for developing network-based biomarkers in psychiatry.
Assuntos
Encéfalo/fisiologia , Rede Nervosa/fisiologia , Psicopatologia , Adolescente , Adulto , Criança , Estudos de Coortes , Feminino , Humanos , Masculino , Análise Multivariada , Reprodutibilidade dos Testes , Caracteres Sexuais , Adulto JovemRESUMO
Continuous treatments (e.g., doses) arise often in practice, but many available causal effect estimators are limited by either requiring parametric models for the effect curve, or by not allowing doubly robust covariate adjustment. We develop a novel kernel smoothing approach that requires only mild smoothness assumptions on the effect curve, and still allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and give a procedure for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.
RESUMO
This paper considers a sparse spiked covariancematrix model in the high-dimensional setting and studies the minimax estimation of the covariance matrix and the principal subspace as well as the minimax rank detection. The optimal rate of convergence for estimating the spiked covariance matrix under the spectral norm is established, which requires significantly different techniques from those for estimating other structured covariance matrices such as bandable or sparse covariance matrices. We also establish the minimax rate under the spectral norm for estimating the principal subspace, the primary object of interest in principal component analysis. In addition, the optimal rate for the rank detection boundary is obtained. This result also resolves the gap in a recent paper by Berthet and Rigollet [2] where the special case of rank one is considered.
RESUMO
A new formulation for the construction of adaptive confidence bands in non-parametric function estimation problems is proposed. Confidence bands are constructed which have size that adapts to the smoothness of the function while guaranteeing that both the relative excess mass of the function lying outside the band and the measure of the set of points where the function lies outside the band are small. It is shown that the bands adapt over a maximum range of Lipschitz classes. The adaptive confidence band can be easily implemented in standard statistical software with wavelet support. Numerical performance of the procedure is investigated using both simulated and real datasets. The numerical results agree well with the theoretical analysis. The procedure can be easily modified and used for other nonparametric function estimation models.
RESUMO
We study the rate of convergence for the largest eigenvalue distributions in the Gaussian unitary and orthogonal ensembles to their Tracy-Widom limits. We show that one can achieve an O(N-2/3) rate with particular choices of the centering and scaling constants. The arguments here also shed light on more complicated cases of Laguerre and Jacobi ensembles, in both unitary and orthogonal versions. Numerical work shows that the suggested constants yield reasonable approximations even for suprisingly small values of N.
RESUMO
Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse.