RESUMO
Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when [Formula: see text] data points are i.i.d. sampled from a general [Formula: see text]-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of [Formula: see text] and kernel bandwidth [Formula: see text], the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be [Formula: see text] at finite large [Formula: see text] up to log factors, achieved at the scaling of [Formula: see text]. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.
RESUMO
Single-cell RNA sequencing has been widely used to investigate cell state transitions and gene dynamics of biological processes. Current strategies to infer the sequential dynamics of genes in a process typically rely on constructing cell pseudotime through cell trajectory inference. However, the presence of concurrent gene processes in the same group of cells and technical noise can obscure the true progression of the processes studied. To address this challenge, we present GeneTrajectory, an approach that identifies trajectories of genes rather than trajectories of cells. Specifically, optimal transport distances are calculated between gene distributions across the cell-cell graph to extract gene programs and define their gene pseudotemporal order. Here we demonstrate that GeneTrajectory accurately extracts progressive gene dynamics in myeloid lineage maturation. Moreover, we show that GeneTrajectory deconvolves key gene programs underlying mouse skin hair follicle dermal condensate differentiation that could not be resolved by cell trajectory approaches. GeneTrajectory facilitates the discovery of gene programs that control the changes and activities of biological processes.
RESUMO
Accurate cell marker identification in single-cell RNA-seq data is crucial for understanding cellular diversity and function. An ideal marker is highly specific in identifying cells that are similar in terms of function and state. Current marker identification methods, commonly based on clustering and differential expression, capture general cell-type markers but often miss markers for subtypes or functional cell subsets, with their performance largely dependent on clustering quality. Moreover, cluster-independent approaches tend to favor genes that lack the specificity required to characterize regions within the transcriptomic space at multiple scales. Here we introduce Localized Marker Detector (LMD), a novel tool to identify "localized genes" - genes with expression profiles specific to certain groups of highly similar cells - thereby characterizing cellular diversity in a multi-resolution and fine-grained manner. LMD's strategy involves building a cell-cell affinity graph, diffusing the gene expression value across the cell graph, and assigning a score to each gene based on its diffusion dynamics. We show that LMD exhibits superior accuracy in recovering known cell-type markers in the Tabula Muris bone marrow dataset relative to other methods for marker identification. Notably, markers favored by LMD exhibit localized expression, whereas markers prioritized by other clustering-free algorithms are often dispersed in the transcriptomic space. We further group the markers suggested by LMD into functional gene modules to improve the separation of cell types and subtypes in a more fine-grained manner. These modules also identify other sources of variation, such as cell cycle status. In conclusion, LMD is a novel algorithm that can identify fine-grained markers for cell subtypes or functional states without relying on clustering or differential expression analysis. LMD exploits the complex interactions among cells and reveals cellular diversity at high resolution.
RESUMO
A better understanding of various patterns in the coronavirus disease 2019 (COVID-19) spread in different parts of the world is crucial to its prevention and control. Motivated by the previously developed Global Epidemic and Mobility (GLEaM) model, this paper proposes a new stochastic dynamic model to depict the evolution of COVID-19. The model allows spatial and temporal heterogeneity of transmission parameters and involves transportation between regions. Based on the proposed model, this paper also designs a two-step procedure for parameter inference, which utilizes the correlation between regions through a prior distribution that imposes graph Laplacian regularization on transmission parameters. Experiments on simulated data and real-world data in China and Europe indicate that the proposed model achieves higher accuracy in predicting the newly confirmed cases than baseline models.
Assuntos
COVID-19 , Epidemias , COVID-19/epidemiologia , China/epidemiologia , Europa (Continente)/epidemiologia , HumanosRESUMO
The recent success of generative adversarial networks and variational learning suggests that training a classification network may work well in addressing the classical two-sample problem, which asks to differentiate two densities given finite samples from each one. Network-based methods have the computational advantage that the algorithm scales to large datasets. This paper considers using the classification logit function, which is provided by a trained classification neural network and evaluated on the testing set split of the two datasets, to compute a two-sample statistic. To analyze the approximation and estimation error of the logit function to differentiate near-manifold densities, we introduce a new result of near-manifold integral approximation by neural networks. We then show that the logit function provably differentiates two sub-exponential densities given that the network is sufficiently parametrized, and for on or near manifold densities, the needed network complexity is reduced to only scale with the intrinsic dimensionality. In experiments, the network logit test demonstrates better performance than previous network-based tests using classification accuracy, and also compares favorably to certain kernel maximum mean discrepancy tests on synthetic datasets and hand-written digit datasets.
RESUMO
Comprehensive and accurate comparisons of transcriptomic distributions of cells from samples taken from two different biological states, such as healthy versus diseased individuals, are an emerging challenge in single-cell RNA sequencing (scRNA-seq) analysis. Current methods for detecting differentially abundant (DA) subpopulations between samples rely heavily on initial clustering of all cells in both samples. Often, this clustering step is inadequate since the DA subpopulations may not align with a clear cluster structure, and important differences between the two biological states can be missed. Here, we introduce DA-seq, a targeted approach for identifying DA subpopulations not restricted to clusters. DA-seq is a multiscale method that quantifies a local DA measure for each cell, which is computed from its k nearest neighboring cells across a range of k values. Based on this measure, DA-seq delineates contiguous significant DA subpopulations in the transcriptomic space. We apply DA-seq to several scRNA-seq datasets and highlight its improved ability to detect differences between distinct phenotypes in severe versus mildly ill COVID-19 patients, melanomas subjected to immune checkpoint therapy comparing responders to nonresponders, embryonic development at two time points, and young versus aging brain tissue. DA-seq enabled us to detect differences between these phenotypes. Importantly, we find that DA-seq not only recovers the DA cell types as discovered in the original studies but also reveals additional DA subpopulations that were not described before. Analysis of these subpopulations yields biological insights that would otherwise be undetected using conventional computational approaches.
Assuntos
Envelhecimento/genética , COVID-19/genética , Linhagem da Célula/genética , Melanoma/genética , RNA Citoplasmático Pequeno/genética , Neoplasias Cutâneas/genética , Envelhecimento/metabolismo , Linfócitos B/imunologia , Linfócitos B/virologia , Encéfalo/citologia , Encéfalo/metabolismo , COVID-19/imunologia , COVID-19/patologia , COVID-19/virologia , Linhagem da Célula/imunologia , Citocinas/genética , Citocinas/imunologia , Conjuntos de Dados como Assunto , Células Dendríticas/imunologia , Células Dendríticas/virologia , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Melanoma/imunologia , Melanoma/patologia , Monócitos/imunologia , Monócitos/virologia , Fenótipo , RNA Citoplasmático Pequeno/imunologia , SARS-CoV-2/patogenicidade , Índice de Gravidade de Doença , Análise de Célula Única/métodos , Neoplasias Cutâneas/imunologia , Neoplasias Cutâneas/patologia , Linfócitos T/imunologia , Linfócitos T/virologia , TranscriptomaRESUMO
The paper introduces a new kernel-based Maximum Mean Discrepancy (MMD) statistic for measuring the distance between two distributions given finitely many multivariate samples. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives by incorporating local covariance matrices and constructing an anisotropic kernel. The kernel matrix is asymmetric; it computes the affinity between [Formula: see text] data points and a set of [Formula: see text] reference points, where [Formula: see text] can be drastically smaller than [Formula: see text]. While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency of the test is proved, under mild assumptions of the kernel, as long as [Formula: see text], and a finite-sample lower bound of the testing power is obtained. Applications to flow cytometry and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions.
RESUMO
The extraction of clusters from a dataset which includes multiple clusters and a significant background component is a non-trivial task of practical importance. In image analysis this manifests for example in anomaly detection and target detection. The traditional spectral clustering algorithm, which relies on the leading K eigenvectors to detect K clusters, fails in such cases. In this paper we propose the spectral embedding norm which sums the squared values of the first I normalized eigenvectors, where I can be significantly larger than K. We prove that this quantity can be used to separate clusters from the background in unbalanced settings, including extreme cases such as outlier detection. The performance of the algorithm is not sensitive to the choice of I, and we demonstrate its application on synthetic and real-world remote sensing and neuroimaging datasets.
RESUMO
BACKGROUND: Hypertension is common in China and its prevalence is rising, yet it remains inadequately controlled. Few studies have the capacity to characterise the epidemiology and management of hypertension across many heterogeneous subgroups. We did a study of the prevalence, awareness, treatment, and control of hypertension in China and assessed their variations across many subpopulations. METHODS: We made use of data generated in the China Patient-Centered Evaluative Assessment of Cardiac Events (PEACE) Million Persons Project from Sept 15, 2014, to June 20, 2017, a population-based screening project that enrolled around 1·7 million community-dwelling adults aged 35-75 years from all 31 provinces in mainland China. In this population, we defined hypertension as systolic blood pressure of at least 140 mm Hg, or diastolic blood pressure of at least 90 mm Hg, or self-reported antihypertensive medication use in the previous 2 weeks. Hypertension awareness, treatment, and control were defined, respectively, among hypertensive adults as a self-reported diagnosis of hypertension, current use of antihypertensive medication, and blood pressure of less than 140/90 mm Hg. We assessed awareness, treatment, and control in 264â475 population subgroups-defined a priori by all possible combinations of 11 demographic and clinical factors (age [35-44, 45-54, 55-64, and 65-75 years], sex [men and women], geographical region [western, central, and eastern China], urbanity [urban vs rural], ethnic origin [Han and non-Han], occupation [farmer and non-farmer], annual household income [< ¥10â000, ¥10â000-50â000, and ≥¥50â000], education [primary school and below, middle school, high school, and college and above], previous cardiovascular events [yes or no], current smoker [yes or no], and diabetes [yes or no]), and their associations with individual and primary health-care site characteristics, using mixed models. FINDINGS: The sample contained 1â738â886 participants with a mean age of 55·6 years (SD 9·7), 59·5% of whom were women. 44·7% (95% CI 44·6-44·8) of the sample had hypertension, of whom 44·7% (44·6-44·8) were aware of their diagnosis, 30·1% (30·0-30·2) were taking prescribed antihypertensive medications, and 7·2% (7·1-7·2) had achieved control. The age-standardised and sex-standardised rates of hypertension prevalence, awareness, treatment, and control were 37·2% (37·1-37·3), 36·0% (35·8-36·2), 22·9% (22·7-23·0), and 5·7% (5·6-5·7), respectively. The most commonly used medication class was calcium-channel blockers (55·2%, 55·0-55·4). Among individuals whose hypertension was treated but not controlled, 81·5% (81·3-81·6) were using only one medication. The proportion of participants who were aware of their hypertension and were receiving treatment varied significantly across subpopulations; lower likelihoods of awareness and treatment were associated with male sex, younger age, lower income, and an absence of previous cardiovascular events, diabetes, obesity, or alcohol use (all p<0·01). By contrast, control rate was universally low across all subgroups (<30·0%). INTERPRETATION: Among Chinese adults aged 35-75 years, nearly half have hypertension, fewer than a third are being treated, and fewer than one in twelve are in control of their blood pressure. The low number of people in control is ubiquitous in all subgroups of the Chinese population and warrants broad-based, global strategy, such as greater efforts in prevention, as well as better screening and more effective and affordable treatment. FUNDING: Ministry of Finance and National Health and Family Planning Commission, China.
Assuntos
Conhecimentos, Atitudes e Prática em Saúde , Hipertensão/epidemiologia , Programas de Rastreamento , Adulto , Fatores Etários , Idoso , Anti-Hipertensivos/uso terapêutico , China/epidemiologia , Feminino , Humanos , Hipertensão/diagnóstico , Hipertensão/tratamento farmacológico , Masculino , Pessoa de Meia-Idade , Prevalência , Fatores SexuaisRESUMO
One of the primary challenges in single particle reconstruction with cryo-electron microscopy is to find a three-dimensional model of a molecule using its noisy two-dimensional projection-images. As the imaging orientations of the projection-images are unknown, we suggest a common-lines-based method to simultaneously estimate the imaging orientations of all images that is independent of the distribution of the orientations. Since the relative orientation of each pair of images may only be estimated up to a two-way handedness ambiguity, we suggest an efficient procedure to consistently assign the same handedness to all relative orientations. This is achieved by casting the handedness assignment problem as a graph-partitioning problem. Once a consistent handedness of all relative orientations is determined, the orientations corresponding to all projection-images are determined simultaneously, thus rendering the method robust to noise. Our proposed method has also the advantage of allowing one to incorporate confidence information regarding the trustworthiness of each relative orientation in a natural manner. We demonstrate the efficacy of our approach using simulated clean and noisy data.
RESUMO
Nucleation of various ordered phases in block copolymers is studied by examining the free-energy landscape within the self-consistent field theory. The minimum energy path (MEP) connecting two ordered phases is computed using a recently developed string method. The shape, size, and free-energy barrier of critical nuclei are obtained from the MEP, providing information about the emergence of a stable ordered phase from a metastable phase. In particular, structural evolution of embryonic gyroid nucleus is predicted to follow two possible MEPs, revealing an interesting transition pathway with an intermediate perforated layered structure.