Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 61
Filter
1.
Stat Med ; 43(14): 2713-2733, 2024 Jun 30.
Article in English | MEDLINE | ID: mdl-38690642

ABSTRACT

This article presents a novel method for learning time-varying dynamic Bayesian networks. The proposed method breaks down the dynamic Bayesian network learning problem into a sequence of regression inference problems and tackles each problem using the Markov neighborhood regression technique. Notably, the method demonstrates scalability concerning data dimensionality, accommodates time-varying network structure, and naturally handles multi-subject data. The proposed method exhibits consistency and offers superior performance compared to existing methods in terms of estimation accuracy and computational efficiency, as supported by extensive numerical experiments. To showcase its effectiveness, we apply the proposed method to an fMRI study investigating the effective connectivity among various regions of interest (ROIs) during an emotion-processing task. Our findings reveal the pivotal role of the subcortical-cerebellum in emotion processing.


Subject(s)
Bayes Theorem , Emotions , Magnetic Resonance Imaging , Humans , Magnetic Resonance Imaging/methods , Emotions/physiology , Markov Chains , Brain/diagnostic imaging , Brain/physiology , Computer Simulation
2.
J Appl Stat ; 50(11-12): 2624-2647, 2023.
Article in English | MEDLINE | ID: mdl-37529571

ABSTRACT

This paper proposes a dynamic infectious disease model for COVID-19 daily counts data and estimate the model using the Langevinized EnKF algorithm, which is scalable for large-scale spatio-temporal data, converges to the right filtering distribution, and is thus suitable for performing statistical inference and quantifying uncertainty for the underlying dynamic system. Under the framework of the proposed dynamic infectious disease model, we tested the impact of temperature, precipitation, state emergency order and stay home order on the spread of COVID-19 based on the United States county-wise daily counts data. Our numerical results show that warm and humid weather can significantly slow the spread of COVID-19, and the state emergency and stay home orders also help to slow it. This finding provides guidance and support to future policies or acts for mitigating the community transmission and lowering the mortality rate of COVID-19.

3.
J Comput Graph Stat ; 32(2): 448-469, 2023.
Article in English | MEDLINE | ID: mdl-38240013

ABSTRACT

Inference for high-dimensional, large scale and long series dynamic systems is a challenging task in modern data science. The existing algorithms, such as particle filter or sequential importance sampler, do not scale well to the dimension of the system and the sample size of the dataset, and often suffers from the sample degeneracy issue for long series data. The recently proposed Langevinized ensemble Kalman filter (LEnKF) addresses these difficulties in a coherent way. However, it cannot be applied to the case that the dynamic system contains unknown parameters. This article proposes the so-called stochastic approximation-LEnKF for jointly estimating the states and unknown parameters of the dynamic system, where the parameters are estimated on the fly based on the state variables simulated by the LEnKF under the framework of stochastic approximation Markov chain Monte Carlo (MCMC). Under mild conditions, we prove its consistency in parameter estimation and ergodicity in state variable simulations. The proposed algorithm can be used in uncertainty quantification for long series, large scale, and high-dimensional dynamic systems. Numerical results indicate its superiority over the existing algorithms. We employ the proposed algorithm in state-space modeling of the sea surface temperature with a long short term memory (LSTM) network, which indicates its great potential in statistical analysis of complex dynamic systems encountered in modern data science. Supplementary materials for this article are available online.

4.
Stat Med ; 41(20): 4057-4078, 2022 09 10.
Article in English | MEDLINE | ID: mdl-35688606

ABSTRACT

High-dimensional inference is one of fundamental problems in modern biomedical studies. However, the existing methods do not perform satisfactorily. Based on the Markov property of graphical models and the likelihood ratio test, this article provides a simple justification for the Markov neighborhood regression method such that it can be applied to statistical inference for high-dimensional generalized linear models with mixed features. The Markov neighborhood regression method is highly attractive in that it breaks the high-dimensional inference problems into a series of low-dimensional inference problems. The proposed method is applied to the cancer cell line encyclopedia data for identification of the genes and mutations that are sensitive to the response of anti-cancer drugs. The numerical results favor the Markov neighborhood regression method to the existing ones.


Subject(s)
Models, Statistical , Humans , Likelihood Functions , Linear Models , Markov Chains , Regression Analysis
5.
J Stat Comput Simul ; 92(2): 318-336, 2022.
Article in English | MEDLINE | ID: mdl-35559269

ABSTRACT

We propose a class of adaptive stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms, where the drift function is adaptively adjusted according to the gradient of past samples to accelerate the convergence of the algorithm in simulations of the distributions with pathological curvatures. We establish the convergence of the proposed algorithms under mild conditions. The numerical examples indicate that the proposed algorithms can significantly outperform the popular SGMCMC algorithms, such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian Monte Carlo (SGHMC) and preconditioned SGLD, in both simulation and optimization tasks. In particular, the proposed algorithms can converge quickly for the distributions for which the energy landscape possesses pathological curvatures.

6.
J Am Stat Assoc ; 117(540): 1981-1995, 2022.
Article in English | MEDLINE | ID: mdl-36945326

ABSTRACT

Deep learning has been the engine powering many successes of data science. However, the deep neural network (DNN), as the basic model of deep learning, is often excessively over-parameterized, causing many difficulties in training, prediction and interpretation. We propose a frequentist-like method for learning sparse DNNs and justify its consistency under the Bayesian framework: the proposed method could learn a sparse DNN with at most O(n/log(n)) connections and nice theoretical guarantees such as posterior consistency, variable selection consistency and asymptotically optimal generalization bounds. In particular, we establish posterior consistency for the sparse DNN with a mixture Gaussian prior, show that the structure of the sparse DNN can be consistently determined using a Laplace approximation-based marginal posterior inclusion probability approach, and use Bayesian evidence to elicit sparse DNNs learned by an optimization method such as stochastic gradient descent in multiple runs with different initializations. The proposed method is computationally more efficient than standard Bayesian methods for large-scale sparse DNNs. The numerical results indicate that the proposed method can perform very well for large-scale network compression and high-dimensional nonlinear variable selection, both advancing interpretable machine learning.

7.
Stat Probab Lett ; 1802022 Jan.
Article in English | MEDLINE | ID: mdl-34744226

ABSTRACT

Deep learning has achieved great successes in many machine learning tasks. However, the deep neural networks (DNNs) are often severely over-parameterized, making them computationally expensive, memory intensive, less interpretable and mis-calibrated. We study sparse DNNs under the Bayesian framework: we establish posterior consistency and structure selection consistency for Bayesian DNNs with a spike-and-slab prior, and illustrate their performance using examples on high-dimensional nonlinear variable selection, large network compression and model calibration. Our numerical results indicate that sparsity is essential for improving the prediction accuracy and calibration of the DNN.

8.
Biostatistics ; 22(2): 233-249, 2021 04 10.
Article in English | MEDLINE | ID: mdl-33838043

ABSTRACT

Motivated by the study of the molecular mechanism underlying type 1 diabetes with gene expression data collected from both patients and healthy controls at multiple time points, we propose a hybrid Bayesian method for jointly estimating multiple dependent Gaussian graphical models with data observed under distinct conditions, which avoids inversion of high-dimensional covariance matrices and thus can be executed very fast. We prove the consistency of the proposed method under mild conditions. The numerical results indicate the superiority of the proposed method over existing ones in both estimation accuracy and computational efficiency. Extension of the proposed method to joint estimation of multiple mixed graphical models is straightforward.


Subject(s)
Diabetes Mellitus, Type 1 , Gene Regulatory Networks , Bayes Theorem , Diabetes Mellitus, Type 1/genetics , Humans , Models, Statistical , Normal Distribution
9.
Stat ; 9(1)2020 Dec.
Article in English | MEDLINE | ID: mdl-33223572

ABSTRACT

Graphical models have been used in many scientific fields for exploration of conditional independence relationships for a large set of random variables. Although a variety of methods have been proposed in the literature for estimating graphical models with different types of data, none of them is applicable for jointly estimating multiple mixed graphical models. To tackle this problem, we propose a joint mixed learning method. The proposed method is very flexible, which works for various mixed types of data, such as those mixed with Gaussian, multinomial, and Poisson, and also allows people to incorporate domain knowledge into network construction by restricting some links to be included in or excluded from the networks. As an application, the proposed method is applied to pan-cancer network analysis for six types of cancer with data from The Cancer Genome Atlas. To our knowledge, this is the first work for joint estimation of multiple mixed graphical models.

10.
J Am Stat Assoc ; 117(539): 1200-1214, 2020.
Article in English | MEDLINE | ID: mdl-36246416

ABSTRACT

This paper proposes an innovative method for constructing confidence intervals and assessing p-values in statistical inference for high-dimensional linear models. The proposed method has successfully broken the high-dimensional inference problem into a series of low-dimensional inference problems: For each regression coefficient ß i , the confidence interval and p-value are computed by regressing on a subset of variables selected according to the conditional independence relations between the corresponding variable X i and other variables. Since the subset of variables forms a Markov neighborhood of X i in the Markov network formed by all the variables X 1, X 2, … , X p , the proposed method is coined as Markov neighborhood regression. The proposed method is tested on high-dimensional linear, logistic and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones. Based on the Markov neighborhood regression, a method of learning causal structures for high-dimensional linear models is proposed and applied to identification of drug sensitive genes and cancer driver genes. The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other high-dimensional or big data problems as well.

11.
Biometrika ; 107(4): 997-1004, 2020 Dec.
Article in English | MEDLINE | ID: mdl-34305153

ABSTRACT

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC algorithm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

12.
Adv Neural Inf Process Syst ; 34: 15725-15736, 2020 Dec.
Article in English | MEDLINE | ID: mdl-34556969

ABSTRACT

We propose an adaptively weighted stochastic gradient Langevin dynamics algorithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD), for Bayesian learning in big data statistics. The proposed algorithm is essentially a scalable dynamic importance sampler, which automatically flattens the target distribution such that the simulation for a multi-modal distribution can be greatly facilitated. Theoretically, we prove a stability condition and establish the asymptotic convergence of the self-adapting parameter to a unique fixed-point, regardless of the non-convexity of the original energy function; we also present an error analysis for the weighted averaging estimators. Empirically, the CSGLD algorithm is tested on multiple benchmark datasets including CIFAR10 and CIFAR100. The numerical results indicate its superiority over the existing state-of-the-art algorithms in training deep neural networks.

13.
Proc Mach Learn Res ; 119: 2474-2483, 2020 Jul.
Article in English | MEDLINE | ID: mdl-34557675

ABSTRACT

Replica exchange Monte Carlo (reMC), also known as parallel tempering, is an important technique for accelerating the convergence of the conventional Markov Chain Monte Carlo (MCMC) algorithms. However, such a method requires the evaluation of the energy function based on the full dataset and is not scalable to big data. The naïve implementation of reMC in mini-batch settings introduces large biases, which cannot be directly extended to the stochastic gradient MCMC (SGMCMC), the standard sampling method for simulating from deep neural networks (DNNs). In this paper, we propose an adaptive replica exchange SGMCMC (reSGMCMC) to automatically correct the bias and study the corresponding properties. The analysis implies an acceleration-accuracy trade-off in the numerical discretization of a Markov jump process in a stochastic environment. Empirically, we test the algorithm through extensive experiments on various setups and obtain the state-of-the-art results on CIFAR10, CIFAR100, and SVHN in both supervised learning and semi-supervised learning tasks.

14.
IEEE Trans Med Imaging ; 39(2): 357-365, 2020 02.
Article in English | MEDLINE | ID: mdl-31283500

ABSTRACT

Adolescence is a transitional period between the childhood and adulthood with physical changes, as well as increasing emotional development. Studies have shown that the emotional sensitivity is related to a second period of rapid brain growth. However, there is little focus on the trend of brain development during this period. In this paper, we aim to track functional brain connectivity development from late childhood to young adulthood. Mathematically, this problem can be modeled via the estimation of multiple Gaussian graphical models (GGMs). However, most existing methods either require the graph sequence to be fairly long or are only applicable to small graphs. In this paper, we adapted a Bayesian approach incorporating joint estimation of multiple GGMs to overcome the short sequence difficulty, which is also computationally efficient. The data used are the functional magnetic resonance imaging (fMRI) images obtained from the publicly available Philadelphia Neurodevelopmental Cohort (PNC). They include 855 individuals aged 8-22 years who were divided into five different adolescent stages. We summarized the networks with global measurements and applied a hypothesis test across age groups to detect the developmental patterns. Three patterns were detected and defined as consistent development, late puberty, and temporal change. We also discovered several anatomical areas, such as the middle frontal gyrus, putamen gyrus, right lingual gyrus, and right cerebellum crus 2 that are highly involved in the brain functional development. The functional networks, including the salience, subcortical, and auditory networks are significantly developing during the adolescent period.


Subject(s)
Adolescent Development/physiology , Brain/diagnostic imaging , Brain/physiology , Functional Neuroimaging/methods , Magnetic Resonance Imaging/methods , Adolescent , Adult , Bayes Theorem , Child , Humans , Nerve Net/diagnostic imaging , Normal Distribution , Young Adult
15.
Neural Comput ; 31(6): 1183-1214, 2019 06.
Article in English | MEDLINE | ID: mdl-30979349

ABSTRACT

Bayesian networks have been widely used in many scientific fields for describing the conditional independence relationships for a large set of random variables. This letter proposes a novel algorithm, the so-called p-learning algorithm, for learning moral graphs for high-dimensional Bayesian networks. The moral graph is a Markov network representation of the Bayesian network and also the key to construction of the Bayesian network for constraint-based algorithms. The consistency of the p-learning algorithm is justified under the small-n, large-p scenario. The numerical results indicate that the p-learning algorithm significantly outperforms the existing ones, such as the PC, grow-shrink, incremental association, semi-interleaved hiton, hill-climbing, and max-min hill-climbing. Under the sparsity assumption, the p-learning algorithm has a computational complexity of O(p2) even in the worst case, while the existing algorithms have a computational complexity of O(p3) in the worst case.


Subject(s)
Algorithms , Bayes Theorem , Neural Networks, Computer , Humans
16.
Stat Comput ; 29(1): 23-32, 2019 Jan.
Article in English | MEDLINE | ID: mdl-31011242

ABSTRACT

This paper proposes a simple, practical and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo (Pop-SAMC) algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since this algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as "Double Parallel Monte Carlo". The validity of the proposed algorithm is justified mathematically and numerically.

17.
PLoS One ; 14(2): e0212108, 2019.
Article in English | MEDLINE | ID: mdl-30811440

ABSTRACT

This paper proposes a mixture regression model-based method for drug sensitivity prediction. The proposed method explicitly addresses two fundamental issues in drug sensitivity prediction, namely, population heterogeneity and feature selection pertaining to each of the subpopulations. The mixture regression model is estimated using the imputation-conditional consistency algorithm, and the resulting estimator is consistent. This paper also proposes an average-BIC criterion for determining the number of components for the mixture regression model. The proposed method is applied to the CCLE dataset, and the numerical results indicate that the proposed method can make a drastic improvement over the existing ones, such as random forest, support vector regression, and regularized linear regression, in both drug sensitivity prediction and feature selection. The p-values for the comparisons in drug sensitivity prediction can reach the order O(10-8) or lower for the drugs with heterogeneous populations.


Subject(s)
Computational Biology , Drug Resistance , Precision Medicine , Regression Analysis , Support Vector Machine
18.
Stat Interface ; 12(3): 377-385, 2019.
Article in English | MEDLINE | ID: mdl-33859774

ABSTRACT

Restricted Boltzmann machines (RBMs) have become a popular tool of feature coding or extraction for unsupervised learning in recent years. However, there still lacks an efficient algorithm for training the RBM due to that its likelihood function contains an intractable normalizing constant. The existing algorithms, such as contrastive divergence and its variants, approximate the gradient of the likelihood function using Markov chain Monte Carlo. However, the approximation is time consuming and, moreover, the approximation error often impedes the convergence of the training algorithm. This paper proposes a fast algorithm for training RBMs by treating the hidden states as missing data and then estimating the parameters of the RBM via an iterative conditional maximum likelihood estimation approach, which avoids the issue of intractable normalizing constants. The numerical results indicate that the proposed algorithm can provide a drastic improvement over the contrastive divergence algorithm in RBM training. This paper also presents an extension of the proposed algorithm for how to cope with missing data in RBM training and illustrates its application using an example about drug-target interaction prediction.

19.
Ann Appl Stat ; 13(3): 1708-1732, 2019 Sep.
Article in English | MEDLINE | ID: mdl-34349870

ABSTRACT

With the advance of imaging technology, digital pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high resolution. Recent developments in deep-learning methods have enabled us to identify and classify individual cells from digital pathology images at large scale. Reliable statistical approaches to model the spatial pattern of cells can provide new insight into tumor progression and shed light on the biological mechanisms of cancer. We consider the problem of modeling spatial correlations among three commonly seen cells observed in tumor pathology images. A novel geostatistical marking model with interpretable underlying parameters is proposed in a Bayesian framework. We use auxiliary variable MCMC algorithms to sample from the posterior distribution with an intractable normalizing constant. We demonstrate how this model-based analysis can lead to sharper inferences than ordinary exploratory analyses, by means of application to three benchmark datasets and a case study on the pathology images of 188 lung cancer patients. The case study shows that the spatial correlation between tumor and stromal cells predicts patient prognosis. This statistical methodology not only presents a new model for characterizing spatial correlations in a multitype spatial point pattern conditioning on the locations of the points, but also provides a new perspective for understanding the role of cell-cell interactions in cancer progression.

20.
Biostatistics ; 20(4): 565-581, 2019 10 01.
Article in English | MEDLINE | ID: mdl-29788035

ABSTRACT

Digital pathology imaging of tumor tissues, which captures histological details in high resolution, is fast becoming a routine clinical procedure. Recent developments in deep-learning methods have enabled the identification, characterization, and classification of individual cells from pathology images analysis at a large scale. This creates new opportunities to study the spatial patterns of and interactions among different types of cells. Reliable statistical approaches to modeling such spatial patterns and interactions can provide insight into tumor progression and shed light on the biological mechanisms of cancer. In this article, we consider the problem of modeling a pathology image with irregular locations of three different types of cells: lymphocyte, stromal, and tumor cells. We propose a novel Bayesian hierarchical model, which incorporates a hidden Potts model to project the irregularly distributed cells to a square lattice and a Markov random field prior model to identify regions in a heterogeneous pathology image. The model allows us to quantify the interactions between different types of cells, some of which are clinically meaningful. We use Markov chain Monte Carlo sampling techniques, combined with a double Metropolis-Hastings algorithm, in order to simulate samples approximately from a distribution with an intractable normalizing constant. The proposed model was applied to the pathology images of $205$ lung cancer patients from the National Lung Screening trial, and the results show that the interaction strength between tumor and stromal cells predicts patient prognosis (P = $0.005$). This statistical methodology provides a new perspective for understanding the role of cell-cell interactions in cancer progression.


Subject(s)
Algorithms , Image Interpretation, Computer-Assisted , Lung Neoplasms/diagnostic imaging , Lung Neoplasms/pathology , Models, Statistical , Bayes Theorem , Humans , Markov Chains , Monte Carlo Method
SELECTION OF CITATIONS
SEARCH DETAIL
...