Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Proc Natl Acad Sci U S A ; 120(1): e2214972120, 2023 01 03.
Artigo em Inglês | MEDLINE | ID: mdl-36580592

RESUMO

Regression learning is one of the long-standing problems in statistics, machine learning, and deep learning (DL). We show that writing this problem as a probabilistic expectation over (unknown) feature probabilities - thus increasing the number of unknown parameters and seemingly making the problem more complex-actually leads to its simplification, and allows incorporating the physical principle of entropy maximization. It helps decompose a very general setting of this learning problem (including discretization, feature selection, and learning multiple piece-wise linear regressions) into an iterative sequence of simple substeps, which are either analytically solvable or cheaply computable through an efficient second-order numerical solver with a sublinear cost scaling. This leads to the computationally cheap and robust non-DL second-order Sparse Probabilistic Approximation for Regression Task Analysis (SPARTAn) algorithm, that can be efficiently applied to problems with millions of feature dimensions on a commodity laptop, when the state-of-the-art learning tools would require supercomputers. SPARTAn is compared to a range of commonly used regression learning tools on synthetic problems and on the prediction of the El Niño Southern Oscillation, the dominant interannual mode of tropical climate variability. The obtained SPARTAn learners provide more predictive, sparse, and physically explainable data descriptions, clearly discerning the important role of ocean temperature variability at the thermocline in the equatorial Pacific. SPARTAn provides an easily interpretable description of the timescales by which these thermocline temperature features evolve and eventually express at the surface, thereby enabling enhanced predictability of the key drivers of the interannual climate.


Assuntos
El Niño Oscilação Sul , Clima Tropical , Entropia , Temperatura , Algoritmos
2.
Proc Natl Acad Sci U S A ; 119(9)2022 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-35197293

RESUMO

Entropic outlier sparsification (EOS) is proposed as a cheap and robust computational strategy for learning in the presence of data anomalies and outliers. EOS dwells on the derived analytic solution of the (weighted) expected loss minimization problem subject to Shannon entropy regularization. An identified closed-form solution is proven to impose additional costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically symmetric Gaussians-used heuristically in many popular data analysis algorithms-represent an optimal and least-biased choice for the nonparametric probability distributions when working with squared Euclidean distances. The performance of EOS is compared to a range of commonly used tools on synthetic problems and on partially mislabeled supervised classification problems from biomedicine. Applying EOS for coinference of data anomalies during learning is shown to allow reaching an accuracy of [Formula: see text] when predicting patient mortality after heart failure, statistically significantly outperforming predictive performance of common learning tools for the same data.

3.
Neural Comput ; 36(6): 1198-1227, 2024 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-38669692

RESUMO

Small data learning problems are characterized by a significant discrepancy between the limited number of response variable observations and the large feature space dimension. In this setting, the common learning tools struggle to identify the features important for the classification task from those that bear no relevant information and cannot derive an appropriate learning rule that allows discriminating among different classes. As a potential solution to this problem, here we exploit the idea of reducing and rotating the feature space in a lower-dimensional gauge and propose the gauge-optimal approximate learning (GOAL) algorithm, which provides an analytically tractable joint solution to the dimension reduction, feature segmentation, and classification problems for small data learning problems. We prove that the optimal solution of the GOAL algorithm consists in piecewise-linear functions in the Euclidean space and that it can be approximated through a monotonically convergent algorithm that presents-under the assumption of a discrete segmentation of the feature space-a closed-form solution for each optimization substep and an overall linear iteration cost scaling. The GOAL algorithm has been compared to other state-of-the-art machine learning tools on both synthetic data and challenging real-world applications from climate science and bioinformatics (i.e., prediction of the El Niño Southern Oscillation and inference of epigenetically induced gene-activity networks from limited experimental data). The experimental results show that the proposed algorithm outperforms the reported best competitors for these problems in both learning performance and computational cost.

4.
Neural Comput ; 34(5): 1220-1255, 2022 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-35344997

RESUMO

Classification problems in the small data regime (with small data statistic T and relatively large feature space dimension D) impose challenges for the common machine learning (ML) and deep learning (DL) tools. The standard learning methods from these areas tend to show a lack of robustness when applied to data sets with significantly fewer data points than dimensions and quickly reach the overfitting bound, thus leading to poor performance beyond the training set. To tackle this issue, we propose eSPA+, a significant extension of the recently formulated entropy-optimal scalable probabilistic approximation algorithm (eSPA). Specifically, we propose to change the order of the optimization steps and replace the most computationally expensive subproblem of eSPA with its closed-form solution. We prove that with these two enhancements, eSPA+ moves from the polynomial to the linear class of complexity scaling algorithms. On several small data learning benchmarks, we show that the eSPA+ algorithm achieves a many-fold speed-up with respect to eSPA and even better performance results when compared to a wide array of ML and DL tools. In particular, we benchmark eSPA+ against the standard eSPA and the main classes of common learning algorithms in the small data regime: various forms of support vector machines, random forests, and long short-term memory algorithms. In all the considered applications, the common learning methods and eSPA are markedly outperformed by eSPA+, which achieves significantly higher prediction accuracy with an orders-of-magnitude lower computational cost.


Assuntos
Algoritmos , Aprendizado de Máquina , Entropia , Máquina de Vetores de Suporte
5.
Neural Comput ; 32(8): 1563-1579, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32521216

RESUMO

Overfitting and treatment of small data are among the most challenging problems in machine learning (ML), when a relatively small data statistics size T is not enough to provide a robust ML fit for a relatively large data feature dimension D. Deploying a massively parallel ML analysis of generic classification problems for different D and T, we demonstrate the existence of statistically significant linear overfitting barriers for common ML methods. The results reveal that for a robust classification of bioinformatics-motivated generic problems with the long short-term memory deep learning classifier (LSTM), one needs in the best case a statistics T that is at least 13.8 times larger than the feature dimension D. We show that this overfitting barrier can be breached at a 10-12 fraction of the computational cost by means of the entropy-optimal scalable probabilistic approximations algorithm (eSPA), performing a joint solution of the entropy-optimal Bayesian network inference and feature space segmentation problems. Application of eSPA to experimental single cell RNA sequencing data exhibits a 30-fold classification performance boost when compared to standard bioinformatics tools and a 7-fold boost when compared to the deep learning LSTM classifier.

6.
Proc Natl Acad Sci U S A ; 114(19): 4863-4868, 2017 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-28432182

RESUMO

The applicability of many computational approaches is dwelling on the identification of reduced models defined on a small set of collective variables (colvars). A methodology for scalable probability-preserving identification of reduced models and colvars directly from the data is derived-not relying on the availability of the full relation matrices at any stage of the resulting algorithm, allowing for a robust quantification of reduced model uncertainty and allowing us to impose a priori available physical information. We show two applications of the methodology: (i) to obtain a reduced dynamical model for a polypeptide dynamics in water and (ii) to identify diagnostic rules from a standard breast cancer dataset. For the first example, we show that the obtained reduced dynamical model can reproduce the full statistics of spatial molecular configurations-opening possibilities for a robust dimension and model reduction in molecular dynamics. For the breast cancer data, this methodology identifies a very simple diagnostics rule-free of any tuning parameters and exhibiting the same performance quality as the state of the art machine-learning applications with multiple tuning parameters reported for this problem.

7.
Proc Natl Acad Sci U S A ; 111(41): 14651-6, 2014 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-25267630

RESUMO

Discrete state models are a common tool of modeling in many areas. E.g., Markov state models as a particular representative of this model family became one of the major instruments for analysis and understanding of processes in molecular dynamics (MD). Here we extend the scope of discrete state models to the case of systematically missing scales, resulting in a nonstationary and nonhomogeneous formulation of the inference problem. We demonstrate how the recently developed tools of nonstationary data analysis and information theory can be used to identify the simultaneously most optimal (in terms of describing the given data) and most simple (in terms of complexity and causality) discrete state models. We apply the resulting formalism to a problem from molecular dynamics and show how the results can be used to understand the spatial and temporal causality information beyond the usual assumptions. We demonstrate that the most optimal explanation for the appropriately discretized/coarse-grained MD torsion angles data in a polypeptide is given by the causality that is localized both in time and in space, opening new possibilities for deploying percolation theory and stochastic subgridscale modeling approaches in the area of MD.

8.
J Imaging ; 8(6)2022 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-35735955

RESUMO

We propose a pipeline for synthetic generation of personalized Computer Tomography (CT) images, with a radiation exposure evaluation and a lifetime attributable risk (LAR) assessment. We perform a patient-specific performance evaluation for a broad range of denoising algorithms (including the most popular deep learning denoising approaches, wavelets-based methods, methods based on Mumford−Shah denoising, etc.), focusing both on accessing the capability to reduce the patient-specific CT-induced LAR and on computational cost scalability. We introduce a parallel Probabilistic Mumford−Shah denoising model (PMS) and show that it markedly-outperforms the compared common denoising methods in denoising quality and cost scaling. In particular, we show that it allows an approximately 22-fold robust patient-specific LAR reduction for infants and a 10-fold LAR reduction for adults. Using a normal laptop, the proposed algorithm for PMS allows cheap and robust (with a multiscale structural similarity index >90%) denoising of very large 2D videos and 3D images (with over 107 voxels) that are subject to ultra-strong noise (Gaussian and non-Gaussian) for signal-to-noise ratios far below 1.0. The code is provided for open access.

9.
Cells ; 11(21)2022 10 27.
Artigo em Inglês | MEDLINE | ID: mdl-36359800

RESUMO

Upon chronic stress, a fraction of individuals shows stress resilience, which can prevent long-term mental dysfunction. The underlying molecular mechanisms are complex and have not yet been fully understood. In this study, we performed a data-driven behavioural stratification together with single-cell transcriptomics of the hippocampus in a mouse model of chronic social defeat stress. Our work revealed that in a sub-group exhibiting molecular responses upon chronic stress, the dorsal hippocampus is particularly involved in neuroimmune responses, angiogenesis, myelination, and neurogenesis, thereby enabling brain restoration and homeostasis after chronic stress. Based on these molecular insights, we applied rapamycin after the stress as a proof-of-concept pharmacological intervention and were able to substantially increase stress resilience. Our findings serve as a data resource and can open new avenues for further understanding of molecular processes underlying stress response and for targeted interventions supporting resilience.


Assuntos
Derrota Social , Estresse Psicológico , Camundongos , Masculino , Animais , Hipocampo , Neurogênese , Modelos Animais de Doenças
10.
iScience ; 24(3): 102171, 2021 Mar 19.
Artigo em Inglês | MEDLINE | ID: mdl-33665584

RESUMO

With the development of machine learning in recent years, it is possible to glean much more information from an experimental data set to study matter. In this perspective, we discuss some state-of-the-art data-driven tools to analyze latent effects in data and explain their applicability in natural science, focusing on two recently introduced, physics-motivated computationally cheap tools-latent entropy and latent dimension. We exemplify their capabilities by applying them on several examples in the natural sciences and show that they reveal so far unobserved features such as, for example, a gradient in a magnetic measurement and a latent network of glymphatic channels from the mouse brain microscopy data. What sets these techniques apart is the relaxation of restrictive assumptions typical of many machine learning models and instead incorporating aspects that best fit the dynamical systems at hand.

11.
Front Artif Intell ; 4: 739432, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35072059

RESUMO

Mislabeling of cases as well as controls in case-control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing.

12.
Sci Rep ; 11(1): 15066, 2021 07 29.
Artigo em Inglês | MEDLINE | ID: mdl-34326363

RESUMO

How information in the nervous system is encoded by patterns of action potentials (i.e. spikes) remains an open question. Multi-neuron patterns of single spikes are a prime candidate for spike time encoding but their temporal variability requires further characterisation. Here we show how known sources of spike count variability affect stimulus-evoked spike time patterns between neurons separated over multiple layers and columns of adult rat somatosensory cortex in vivo. On subsets of trials (clusters) and after controlling for stimulus-response adaptation, spike time differences between pairs of neurons are "time-warped" (compressed/stretched) by trial-to-trial changes in shared excitability, explaining why fixed spike time patterns and noise correlations are seldom reported. We show that predicted cortical state is correlated between groups of 4 neurons, introducing the possibility of spike time pattern modulation by population-wide trial-to-trial changes in excitability (i.e. cortical state). Under the assumption of state-dependent coding, we propose an improved potential encoding capacity.


Assuntos
Fenômenos Fisiológicos do Sistema Nervoso , Sistema Nervoso , Neurônios/fisiologia , Córtex Visual/fisiologia , Potenciais de Ação/fisiologia , Animais , Humanos , Modelos Neurológicos , Ratos , Córtex Somatossensorial/fisiologia
13.
Sci Rep ; 8(1): 1796, 2018 01 29.
Artigo em Inglês | MEDLINE | ID: mdl-29379123

RESUMO

The Markovian invariant measure is a central concept in many disciplines. Conventional numerical techniques for data-driven computation of invariant measures rely on estimation and further numerical processing of a transition matrix. Here we show how the quality of data-driven estimation of a transition matrix crucially depends on the validity of the statistical independence assumption for transition probabilities. Moreover, the cost of the invariant measure computation in general scales cubically with the dimension - and is usually unfeasible for realistic high-dimensional systems. We introduce a method relaxing the independence assumption of transition probabilities that scales quadratically in situations with latent variables. Applications of the method are illustrated on the Lorenz-63 system and for the molecular dynamics (MD) simulation data of the α-synuclein protein. We demonstrate how the conventional methodologies do not provide good estimates of the invariant measure based upon the available α-synuclein MD data. Applying the introduced approach to these MD data we detect two robust meta-stable states of α-synuclein and a linear transition between them, involving transient formation of secondary structure, qualitatively consistent with previous purely experimental reports.

14.
Phys Rev E Stat Nonlin Soft Matter Phys ; 76(6 Pt 2): 066702, 2007 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-18233938

RESUMO

Markov jump processes can be used to model the effective dynamics of observables in applications ranging from molecular dynamics to finance. In this paper we present a different method which allows the inverse modeling of Markov jump processes based on incomplete observations in time: We consider the case of a given time series of the discretely observed jump process. We show how to compute efficiently the maximum likelihood estimator of its infinitesimal generator and demonstrate in detail that the method allows us to handle observations nonequidistant in time. The method is based on the work of and Bladt and Sørensen [J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67, 395 (2005)] but scales much more favorably than it with the length of the time series and the dimension and size of the state space of the jump process. We illustrate its performance on a toy problem as well as on data arising from simulations of biochemical kinetics of a genetic toggle switch.

15.
Phys Rev E Stat Nonlin Soft Matter Phys ; 76(1 Pt 2): 016706, 2007 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-17677593

RESUMO

The generalized Langevin equation is useful for modeling a wide range of physical processes. Unfortunately its parameters, especially the memory function, are difficult to determine for nontrivial processes. We establish relations between a time-discrete generalized Langevin model and discrete multivariate autoregressive (AR) or autoregressive moving average models (ARMA). This allows a wide range of discrete linear methods known from time series analysis to be applied. In particular, the determination of the memory function via the order of the respective AR or ARMA model is addressed. The method is illustrated on a one-dimensional test system and subsequently applied to the molecular dynamics time series of a biomolecule that exhibits an interesting relationship between the solvent method used, the respective molecular conformation, and the depth of the memory.

16.
Sci Adv ; 1(7): e1500163, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26601225

RESUMO

Cluster analysis is one of the most popular data analysis tools in a wide range of applied disciplines. We propose and justify a computationally efficient and straightforward-to-implement way of imposing the available information from networks/graphs (a priori available in many application areas) on a broad family of clustering methods. The introduced approach is illustrated on the problem of a noninvasive unsupervised brain signal classification. This task is faced with several challenging difficulties such as nonstationary noisy signals and a small sample size, combined with a high-dimensional feature space and huge noise-to-signal ratios. Applying this approach results in an exact unsupervised classification of very short signals, opening new possibilities for clustering methods in the area of a noninvasive brain-computer interface.

17.
J Comput Chem ; 28(15): 2453-64, 2007 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-17680553

RESUMO

We present a novel method for the identification of the most important conformations of a biomolecular system from molecular dynamics or Metropolis Monte Carlo time series by means of Hidden Markov Models (HMMs). We show that identification is possible based on the observation sequences of some essential torsion or backbone angles. In particular, the method still provides good results even if the conformations do have a strong overlap in these angles. To apply HMMs to angular data, we use von Mises output distributions. The performance of the resulting method is illustrated by numerical tests and by application to a hybrid Monte Carlo time series of trialanine and to MD simulation results of a DNA-oligomer.


Assuntos
Cadeias de Markov , Modelos Moleculares , Alanina/química , Algoritmos , DNA/química , Conformação Molecular , Método de Monte Carlo , Peptídeos/química , Maleabilidade , Probabilidade
18.
J Chem Phys ; 126(15): 155102, 2007 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-17461666

RESUMO

Molecular dynamics simulation generates large quantities of data that must be interpreted using physically meaningful analysis. A common approach is to describe the system dynamics in terms of transitions between coarse partitions of conformational space. In contrast to previous work that partitions the space according to geometric proximity, the authors examine here clustering based on kinetics, merging configurational microstates together so as to identify long-lived, i.e., dynamically metastable, states. As test systems microsecond molecular dynamics simulations of the polyalanines Ala(8) and Ala(12) are analyzed. Both systems clearly exhibit metastability, with some kinetically distinct metastable states being geometrically very similar. Using the backbone torsion rotamer pattern to define the microstates, a definition is obtained of metastable states whose lifetimes considerably exceed the memory associated with interstate dynamics, thus allowing the kinetics to be described by a Markov model. This model is shown to be valid by comparison of its predictions with the kinetics obtained directly from the molecular dynamics simulations. In contrast, clustering based on the hydrogen-bonding pattern fails to identify long-lived metastable states or a reliable Markov model. Finally, an approach is proposed to generate a hierarchical model of networks, each having a different number of metastable states. The model hierarchy yields a qualitative understanding of the multiple time and length scales in the dynamics of biomolecules.


Assuntos
Algoritmos , Biopolímeros/química , Substâncias Macromoleculares/química , Modelos Químicos , Modelos Moleculares , Simulação por Computador , Conformação Molecular , Transição de Fase
19.
J Comput Chem ; 26(9): 941-8, 2005 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-15858828

RESUMO

We present a unified approach for linear and nonlinear sensitivity analysis for models of reaction kinetics that are stated in terms of systems of ordinary differential equations (ODEs). The approach is based on the reformulation of the ODE problem as a density transport problem described by a Fokker-Planck equation. The resulting multidimensional partial differential equation is herein solved by extending the TRAIL algorithm originally introduced by Horenko and Weiser in the context of molecular dynamics (J. Comp. Chem. 2003, 24, 1921) and discussed it in comparison with Monte Carlo techniques. The extended TRAIL approach is fully adaptive and easily allows to study the influence of nonlinear dynamical effects. We illustrate the scheme in application to an enzyme-substrate model problem for sensitivity analysis w.r.t. to initial concentrations and parameter values.


Assuntos
Algoritmos , Modelos Teóricos , Método de Monte Carlo , Enzimas , Cinética , Especificidade por Substrato
20.
J Comput Chem ; 24(15): 1921-9, 2003 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-14515374

RESUMO

This article presents a particle method framework for simulating molecular dynamics. For time integration, the implicit trapezoidal rule is employed, where an explicit predictor enables large time steps. Error estimators for both the temporal and spatial discretization are advocated, and facilitate a fully adaptive propagation. The framework is developed and exemplified in the context of the classical Liouville equation, where Gaussian phase-space packets are used as particles. Simplified variants are discussed briefly. The concept is illustrated by numerical examples for one-dimensional dynamics in double well potential.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA