Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 75
Filtrar
1.
Proc Natl Acad Sci U S A ; 121(10): e2313719121, 2024 Mar 05.
Artigo em Inglês | MEDLINE | ID: mdl-38416677

RESUMO

Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.


Assuntos
Algoritmos , Perfilação da Expressão Gênica , Expressão Gênica , Análise de Célula Única
2.
Proc Natl Acad Sci U S A ; 121(3): e2318989121, 2024 Jan 16.
Artigo em Inglês | MEDLINE | ID: mdl-38215186

RESUMO

The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC's infinitesimal generator (rate) matrix. Motivated by the derivative's extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a "blessing of dimensionality" result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Algoritmos , COVID-19/epidemiologia , Cadeias de Markov
3.
Proc Natl Acad Sci U S A ; 119(45): e2211449119, 2022 11 08.
Artigo em Inglês | MEDLINE | ID: mdl-36322754

RESUMO

The common intuition among the ecologists of the midtwentieth century was that large ecosystems should be more stable than those with a smaller number of species. This view was challenged by Robert May, who found a stability bound for randomly assembled ecosystems; they become unstable for a sufficiently large number of species. In the present work, we show that May's bound greatly changes when the past population densities of a species affect its own current density. This is a common feature in real systems, where the effects of species' interactions may appear after a time lag rather than instantaneously. The local stability of these models with self-interaction is described by bounds, which we characterize in the parameter space. We find a critical delay curve that separates the region of stability from that of instability, and correspondingly, we identify a critical frequency curve that provides the characteristic frequencies of a system at the instability threshold. Finally, we calculate analytically the distributions of eigenvalues that generalize Wigner's as well as Girko's laws. Interestingly, we find that, for sufficiently large delays, the eigenvalues of a randomly coupled system are complex even when the interactions are symmetric.


Assuntos
Ecossistema , Densidade Demográfica
4.
MAGMA ; 2024 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-38349453

RESUMO

OBJECTIVE: To develop and evaluate a technique combining eddy current-nulled convex optimized diffusion encoding (ENCODE) with random matrix theory (RMT)-based denoising to accelerate and improve the apparent signal-to-noise ratio (aSNR) and apparent diffusion coefficient (ADC) mapping in high-resolution prostate diffusion-weighted MRI (DWI). MATERIALS AND METHODS: Eleven subjects with clinical suspicion of prostate cancer were scanned at 3T with high-resolution (HR) (in-plane: 1.0 × 1.0 mm2) ENCODE and standard-resolution (1.6 × 2.2 mm2) bipolar DWI sequences (both had 7 repetitions for averaging, acquisition time [TA] of 5 min 50 s). HR-ENCODE was retrospectively analyzed using three repetitions (accelerated effective TA of 2 min 30 s). The RMT-based denoising pipeline utilized complex DWI signals and Marchenko-Pastur distribution-based principal component analysis to remove additive Gaussian noise in images from multiple coils, b-values, diffusion encoding directions, and repetitions. HR-ENCODE with RMT-based denoising (HR-ENCODE-RMT) was compared with HR-ENCODE in terms of aSNR in prostate peripheral zone (PZ) and transition zone (TZ). Precision and accuracy of ADC were evaluated by the coefficient of variation (CoV) between repeated measurements and mean difference (MD) compared to the bipolar ADC reference, respectively. Differences were compared using two-sided Wilcoxon signed-rank tests (P < 0.05 considered significant). RESULTS: HR-ENCODE-RMT yielded 62% and 56% higher median aSNR than HR-ENCODE (b = 800 s/mm2) in PZ and TZ, respectively (P < 0.001). HR-ENCODE-RMT achieved 63% and 70% lower ADC-CoV than HR-ENCODE in PZ and TZ, respectively (P < 0.001). HR-ENCODE-RMT ADC and bipolar ADC had low MD of 22.7 × 10-6 mm2/s in PZ and low MD of 90.5 × 10-6 mm2/s in TZ. CONCLUSIONS: HR-ENCODE-RMT can shorten the acquisition time and improve the aSNR of high-resolution prostate DWI and achieve accurate and precise ADC measurements in the prostate.

5.
Proc Natl Acad Sci U S A ; 118(45)2021 11 09.
Artigo em Inglês | MEDLINE | ID: mdl-34725154

RESUMO

Fluids in natural systems, like the cytoplasm of a cell, often contain thousands of molecular species that are organized into multiple coexisting phases that enable diverse and specific functions. How interactions between numerous molecular species encode for various emergent phases is not well understood. Here, we leverage approaches from random-matrix theory and statistical physics to describe the emergent phase behavior of fluid mixtures with many species whose interactions are drawn randomly from an underlying distribution. Through numerical simulation and stability analyses, we show that these mixtures exhibit staged phase-separation kinetics and are characterized by multiple coexisting phases at steady state with distinct compositions. Random-matrix theory predicts the number of coexisting phases, validated by simulations with diverse component numbers and interaction parameters. Surprisingly, this model predicts an upper bound on the number of phases, derived from dynamical considerations, that is much lower than the limit from the Gibbs phase rule, which is obtained from equilibrium thermodynamic constraints. We design ensembles that encode either linear or nonmonotonic scaling relationships between the number of components and coexisting phases, which we validate through simulation and theory. Finally, inspired by parallels in biological systems, we show that including nonequilibrium turnover of components through chemical reactions can tunably modulate the number of coexisting phases at steady state without changing overall fluid composition. Together, our study provides a model framework that describes the emergent dynamical and steady-state phase behavior of liquid-like mixtures with many interacting constituents.

6.
Proc Natl Acad Sci U S A ; 118(11)2021 03 16.
Artigo em Inglês | MEDLINE | ID: mdl-33836557

RESUMO

Gene expression profiles of a cellular population, generated by single-cell RNA sequencing, contains rich information about biological state, including cell type, cell cycle phase, gene regulatory patterns, and location within the tissue of origin. A major challenge is to disentangle information about these different biological states from each other, including distinguishing from cell lineage, since the correlation of cellular expression patterns is necessarily contaminated by ancestry. Here, we use a recent advance in random matrix theory, discovered in the context of protein phylogeny, to identify differentiation or ancestry-related processes in single-cell data. Qin and Colwell [C. Qin, L. J. Colwell, Proc. Natl. Acad. Sci. U.S.A. 115, 690-695 (2018)] showed that ancestral relationships in protein sequences create a power-law signature in the covariance eigenvalue distribution. We demonstrate the existence of such signatures in scRNA-seq data and that the genes driving them are indeed related to differentiation and developmental pathways. We predict the existence of similar power-law signatures for cells along linear trajectories and demonstrate this for linearly differentiating systems. Furthermore, we generalize to show that the same signatures can arise for cells along tissue-specific spatial trajectories. We illustrate these principles in diverse tissues and organisms, including the mammalian epidermis and lung, Drosophila whole-embryo, adult Hydra, dendritic cells, the intestinal epithelium, and cells undergoing induced pluripotent stem cells (iPSC) reprogramming. We show how these results can be used to interpret the gradual dynamics of lineage structure along iPSC reprogramming. Together, we provide a framework that can be used to identify signatures of specific biological processes in single-cell data without prior knowledge and identify candidate genes associated with these processes.


Assuntos
Linhagem da Célula , Expressão Gênica , Análise de Célula Única/métodos , Animais , Humanos , Análise de Sequência de RNA/métodos
7.
Entropy (Basel) ; 26(1)2024 Jan 12.
Artigo em Inglês | MEDLINE | ID: mdl-38248192

RESUMO

To estimate the degree of quantum entanglement of random pure states, it is crucial to understand the statistical behavior of entanglement indicators such as the von Neumann entropy, quantum purity, and entanglement capacity. These entanglement metrics are functions of the spectrum of density matrices, and their statistical behavior over different generic state ensembles have been intensively studied in the literature. As an alternative metric, in this work, we study the sum of the square root spectrum of density matrices, which is relevant to negativity and fidelity in quantum information processing. In particular, we derive the finite-size mean and variance formulas of the sum of the square root spectrum over the Bures-Hall ensemble, extending known results obtained recently over the Hilbert-Schmidt ensemble.

8.
BMC Bioinformatics ; 24(1): 180, 2023 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-37131141

RESUMO

BACKGROUND: Large-scale multi-ethnic DNA sequencing data is increasingly available owing to decreasing cost of modern sequencing technologies. Inference of the population structure with such sequencing data is fundamentally important. However, the ultra-dimensionality and complicated linkage disequilibrium patterns across the whole genome make it challenging to infer population structure using traditional principal component analysis based methods and software. RESULTS: We present the ERStruct Python Package, which enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, our package achieves significant improvements in the speed of matrix operations for large-scale data. Additionally, our package features adaptive data splitting capabilities to facilitate computation on GPUs with limited memory. CONCLUSION: Our Python package ERStruct is an efficient and user-friendly tool for estimating the number of top informative principal components that capture population structure from whole genome sequencing data.


Assuntos
Genoma , Software , Sequenciamento Completo do Genoma , Análise de Sequência/métodos , Análise de Componente Principal
9.
Magn Reson Med ; 89(3): 1160-1172, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36219475

RESUMO

PURPOSE: To develop a denoising strategy leveraging redundancy in high-dimensional data. THEORY AND METHODS: The SNR fundamentally limits the information accessible by MRI. This limitation has been addressed by a host of denoising techniques, recently including the so-called MPPCA: principal component analysis of the signal followed by automated rank estimation, exploiting the Marchenko-Pastur distribution of noise singular values. Operating on matrices comprised of data patches, this popular approach objectively identifies noise components and, ideally, allows noise to be removed without introducing artifacts such as image blurring, or nonlocal averaging. The MPPCA rank estimation, however, relies on a large number of noise singular values relative to the number of signal components to avoid such ill effects. This condition is unlikely to be met when data patches and therefore matrices are small, for example due to spatially varying noise. Here, we introduce tensor MPPCA (tMPPCA) for the purpose of denoising multidimensional data, such as from multicontrast acquisitions. Rather than combining dimensions in matrices, tMPPCA uses each dimension of the multidimensional data's inherent tensor-structure to better characterize noise, and to recursively estimate signal components. RESULTS: Relative to matrix-based MPPCA, tMPPCA requires no additional assumptions, and comparing the two in a numerical phantom and a multi-TE diffusion MRI data set, tMPPCA dramatically improves denoising performance. This is particularly true for small data patches, suggesting that tMPPCA can be especially beneficial in such cases. CONCLUSIONS: The MPPCA denoising technique can be extended to high-dimensional data with improved performance for smaller patch sizes.


Assuntos
Algoritmos , Imageamento por Ressonância Magnética , Imageamento por Ressonância Magnética/métodos , Imagem de Difusão por Ressonância Magnética/métodos , Imagens de Fantasmas , Análise de Componente Principal , Razão Sinal-Ruído , Encéfalo/diagnóstico por imagem
10.
Biometrics ; 79(2): 891-902, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-35532153

RESUMO

Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method originally developed for array-based genotype data for computing and selecting top principal components (PCs) that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n / p $n/p$ is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative PCs based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.


Assuntos
Genética Populacional , Polimorfismo de Nucleotídeo Único , Genótipo , Simulação por Computador , Sequenciamento Completo do Genoma
11.
Entropy (Basel) ; 25(1)2023 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-36673250

RESUMO

Quantum graphs are ideally suited to studying the spectral statistics of chaotic systems. Depending on the boundary conditions at the vertices, there are Neumann and Dirichlet graphs. The latter ones correspond to totally disassembled graphs with a spectrum being the superposition of the spectra of the individual bonds. According to the interlacing theorem, Neumann and Dirichlet eigenvalues on average alternate as a function of the wave number, with the consequence that the Neumann spectral statistics deviate from random matrix predictions. There is, e.g., a strict upper bound for the spacing of neighboring Neumann eigenvalues given by the number of bonds (in units of the mean level spacing). Here, we present analytic expressions for level spacing distribution and number variance for ensemble averaged spectra of Dirichlet graphs in dependence of the bond number, and compare them with numerical results. For a number of small Neumann graphs, numerical results for the same quantities are shown, and their deviations from random matrix predictions are discussed.

12.
Entropy (Basel) ; 25(8)2023 Aug 17.
Artigo em Inglês | MEDLINE | ID: mdl-37628255

RESUMO

The high dropout rates in programming courses emphasise the need for monitoring and understanding student engagement, enabling early interventions. This activity can be supported by insights into students' learning behaviours and their relationship with academic performance, derived from student learning log data in learning management systems. However, the high dimensionality of such data, along with their numerous features, pose challenges to their analysis and interpretability. In this study, we introduce entropy-based metrics as a novel manner to represent students' learning behaviours. Employing these metrics, in conjunction with a proven community detection method, we undertake an analysis of learning behaviours across higher- and lower-performing student communities. Furthermore, we examine the impact of the COVID-19 pandemic on these behaviours. The study is grounded in the analysis of empirical data from 391 Software Engineering students over three academic years. Our findings reveal that students in higher-performing communities typically tend to have lower volatility in entropy values and reach stable learning states earlier than their lower-performing counterparts. Importantly, this study provides evidence of the use of entropy as a simple yet insightful metric for educators to monitor study progress, enhance understanding of student engagement, and enable timely interventions.

13.
Entropy (Basel) ; 25(10)2023 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-37895581

RESUMO

This research systematically analyzes the behaviors of correlations among stock prices and the eigenvalues for correlation matrices by utilizing random matrix theory (RMT) for Chinese and US stock markets. Results suggest that most eigenvalues of both markets fall within the predicted distribution intervals by RMT, whereas some larger eigenvalues fall beyond the noises and carry market information. The largest eigenvalue represents the market and is a good indicator for averaged correlations. Further, the average largest eigenvalue shows similar movement with the index for both markets. The analysis demonstrates the fraction of eigenvalues falling beyond the predicted interval, pinpointing major market switching points. It has identified that the average of eigenvector components corresponds to the largest eigenvalue switch with the market itself. The investigation on the second largest eigenvalue and its eigenvector suggests that the Chinese market is dominated by four industries whereas the US market contains three leading industries. The study later investigates how it changes before and after a market crash, revealing that the two markets behave differently, and a major market structure change is observed in the Chinese market but not in the US market. The results shed new light on mining hidden information from stock market data.

14.
Entropy (Basel) ; 25(6)2023 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-37372212

RESUMO

The Dyson index, ß, plays an essential role in random matrix theory, as it labels the so-called "three-fold way" that refers to the symmetries satisfied by ensembles under unitary transformations. As is known, its 1, 2, and 4 values denote the orthogonal, unitary, and symplectic classes, whose matrix elements are real, complex, and quaternion numbers, respectively. It functions, therefore, as a measure of the number of independent non-diagonal variables. On the other hand, in the case of ß ensembles, which represent the tridiagonal form of the theory, it can assume any real positive value, thus losing that function. Our purpose, however, is to show that, when the Hermitian condition of the real matrices generated with a given value of ß is removed, and, as a consequence, the number of non-diagonal independent variables doubles, non-Hermitian matrices exist that asymptotically behave as if they had been generated with a value 2ß. Therefore, it is as if the ß index were, in this way, again operative. It is shown that this effect happens for the three tridiagonal ensembles, namely, the ß-Hermite, the ß-Laguerre, and the ß-Jacobi ensembles.

15.
Entropy (Basel) ; 25(5)2023 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-37238506

RESUMO

Electronic structure theory describes the properties of solids using Bloch states that correspond to highly symmetrical nuclear configurations. However, nuclear thermal motion destroys translation symmetry. Here, we describe two approaches relevant to the time evolution of electronic states in the presence of thermal fluctuations. On the one hand, the direct solution of the time-dependent Schrodinger equation for a tight-binding model reveals the diabatic nature of time evolution. On the other hand, because of random nuclear configurations, the electronic Hamiltonian falls into the class of random matrices, which have universal features in their energy spectra. In the end, we discuss combining two approaches to obtain new insights into the influence of thermal fluctuations on electronic states.

16.
Phys Biol ; 19(5)2022 07 13.
Artigo em Inglês | MEDLINE | ID: mdl-35172289

RESUMO

We develop a theory for thermodynamic instabilities of complex fluids composed of many interacting chemical species organised in families. This model includes partially structured and partially random interactions and can be solved exactly using tools from random matrix theory. The model exhibits three kinds of fluid instabilities: one in which the species form a condensate with a local density that depends on their family (family condensation); one in which species demix in two phases depending on their family (family demixing); and one in which species demix in a random manner irrespective of their family (random demixing). We determine the critical spinodal density of the three types of instabilities and find that the critical spinodal density is finite for both family condensation and family demixing, while for random demixing the critical spinodal density grows as the square root of the number of species. We use the developed framework to describe phase-separation instability of the cytoplasm induced by a change in pH.


Assuntos
Termodinâmica , Humanos
17.
Ann Stat ; 50(2): 949-986, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36120512

RESUMO

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

18.
Proc Natl Acad Sci U S A ; 116(9): 3373-3378, 2019 02 26.
Artigo em Inglês | MEDLINE | ID: mdl-30808733

RESUMO

Predicting ligand biological activity is a key challenge in drug discovery. Ligand-based statistical approaches are often hampered by noise due to undersampling: The number of molecules known to be active or inactive is vastly less than the number of possible chemical features that might determine binding. We derive a statistical framework inspired by random matrix theory and combine the framework with high-quality negative data to discover important chemical differences between active and inactive molecules by disentangling undersampling noise. Our model outperforms standard benchmarks when tested against a set of challenging retrospective tests. We prospectively apply our model to the human muscarinic acetylcholine receptor M1, finding four experimentally confirmed agonists that are chemically dissimilar to all known ligands. The hit rate of our model is significantly higher than the state of the art. Our model can be interpreted and visualized to offer chemical insights about the molecular motifs that are synergistic or antagonistic to M1 agonism, which we have prospectively experimentally verified.


Assuntos
Descoberta de Drogas/estatística & dados numéricos , Modelos Estatísticos , Antagonistas Muscarínicos/química , Receptores Muscarínicos/química , Humanos , Ligantes , Antagonistas Muscarínicos/uso terapêutico , Receptores Muscarínicos/efeitos dos fármacos
19.
Ann Appl Probab ; 32(4): 2967-3003, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-36034074

RESUMO

We study the sample covariance matrix for real-valued data with general population covariance, as well as MANOVA-type covariance estimators in variance components models under null hypotheses of global sphericity. In the limit as matrix dimensions increase proportionally, the asymptotic spectra of such estimators may have multiple disjoint intervals of support, possibly intersecting the negative half line. We show that the distribution of the extremal eigenvalue at each regular edge of the support has a GOE Tracy-Widom limit. Our proof extends a comparison argument of Ji Oon Lee and Kevin Schnelli, replacing a continuous Green function flow by a discrete Lindeberg swapping scheme.

20.
IEEE Trans Inf Theory ; 67(12): 8154-8189, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-35695837

RESUMO

In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data-adaptivity for PCA. In this work, we study how to combine them to get the best of both. We study "sketch and solve" methods that take a random projection (or sketch) first, and compute PCA after. We compute the performance of several popular sketching methods (random iid projections, random sampling, subsampled Hadamard transform, CountSketch, etc) in a general "signal-plus-noise" (or spiked) data model. Compared to well-known works, our results (1) give asymptotically exact results, and (2) apply when the signal components are only slightly above the noise, but the projection dimension is non-negligible. We also study stronger signals allowing more general covariance structures. We find that (a) signal strength decreases under projection in a delicate way depending on the structure of the data and the sketching method, (b) orthogonal projections are slightly more accurate, (c) randomization does not hurt too much, due to concentration of measure, (d) CountSketch can be somewhat improved by a normalization method. Our results have implications for statistical learning and data analysis. We also illustrate that the results are highly accurate in simulations and in analyzing empirical data.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa