Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 442
Filter
1.
Stat Med ; 2024 Sep 11.
Article in English | MEDLINE | ID: mdl-39260448

ABSTRACT

Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the nonrobust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Roc̆ková and George, J Am Stat Associat, 2018, 113(521): 431-444). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with nondifferentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA).

2.
Biometrika ; 111(3): 971-988, 2024 Sep.
Article in English | MEDLINE | ID: mdl-39239267

ABSTRACT

Interval-censored multistate data arise in many studies of chronic diseases, where the health status of a subject can be characterized by a finite number of disease states and the transition between any two states is only known to occur over a broad time interval. We relate potentially time-dependent covariates to multistate processes through semiparametric proportional intensity models with random effects. We study nonparametric maximum likelihood estimation under general interval censoring and develop a stable expectation-maximization algorithm. We show that the resulting parameter estimators are consistent and that the finite-dimensional components are asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we demonstrate through extensive simulation studies that the proposed numerical and inferential procedures perform well in realistic settings. Finally, we provide an application to a major epidemiologic cohort study.

3.
Sensors (Basel) ; 24(17)2024 Aug 28.
Article in English | MEDLINE | ID: mdl-39275466

ABSTRACT

China's rail transit system is developing rapidly, but achieving seamless high-precision localization of trains throughout the entire route in closed environments such as tunnels and culverts still faces significant challenges. Traditional localization technologies cannot meet current demands, and the present paper proposes an autonomous localization method for trains based on pulse observation in a tunnel environment. First, the Letts criterion is used to eliminate abnormal gyro data, the CEEMDAN method is employed for signal decomposition, and the decomposed signals are classified using the continuous mean square error and norm method. Noise reduction is performed using forward linear filtering and dynamic threshold filtering, respectively, maximizing the retention of its effective signal components. A SINS/OD integrated localization model is established, and an observation equation is constructed based on velocity matching, resulting in an 18-dimensional complex state space model. Finally, the EM algorithm is used to address Non-Line-Of-Sight and multipath effect errors. The optimized model is then applied in the Kalman filter to better adapt to the system's observation conditions. By dynamically adjusting the noise covariance, the localization system can continue to maintain continuous high-precision position information output in a tunnel environment.

4.
Biometrics ; 80(3)2024 Jul 01.
Article in English | MEDLINE | ID: mdl-39315604

ABSTRACT

We address the challenge of estimating regression coefficients and selecting relevant predictors in the context of mixed linear regression in high dimensions, where the number of predictors greatly exceeds the sample size. Recent advancements in this field have centered on incorporating sparsity-inducing penalties into the expectation-maximization (EM) algorithm, which seeks to maximize the conditional likelihood of the response given the predictors. However, existing procedures often treat predictors as fixed or overlook their inherent variability. In this paper, we leverage the independence between the predictor and the latent indicator variable of mixtures to facilitate efficient computation and also achieve synergistic variable selection across all mixture components. We establish the non-asymptotic convergence rate of the proposed fast group-penalized EM estimator to the true regression parameters. The effectiveness of our method is demonstrated through extensive simulations and an application to the Cancer Cell Line Encyclopedia dataset for the prediction of anticancer drug sensitivity.


Subject(s)
Algorithms , Computer Simulation , Humans , Linear Models , Likelihood Functions , Antineoplastic Agents/therapeutic use , Antineoplastic Agents/pharmacology , Cell Line, Tumor , Neoplasms/drug therapy , Sample Size , Models, Statistical , Biometry/methods
5.
J Appl Stat ; 51(11): 2090-2115, 2024.
Article in English | MEDLINE | ID: mdl-39247655

ABSTRACT

Osteoporosis is a metabolic bone disorder that is characterized by reduced bone mineral density (BMD) and deterioration of bone microarchitecture. Osteoporosis is highly prevalent among women over 50, leading to skeletal fragility and risk of fracture. Early diagnosis and treatment of those at high risk for fracture is very important in order to avoid morbidity, mortality and economic burden from preventable fractures. The province of Manitoba established a BMD testing program in 1997. The Manitoba BMD registry is now the largest population-based BMD registry in the world, and has detailed information on fracture outcomes and other covariates for over 160,000 BMD assessments. In this paper, we develop a number of methodologies based on ranked-set type sampling designs to estimate the prevalence of osteoporosis among women of age 50 and older in the province of Manitoba. We use a parametric approach based on finite mixture models, as well as the usual approaches using simple random and stratified sampling designs. Results are obtained under perfect and imperfect ranking scenarios while the sampling and ranking costs are incorporated into the study. We observe that rank-based methodologies can be used as cost-efficient methods to monitor the prevalence of osteoporosis.

6.
Biometrics ; 80(3)2024 Jul 01.
Article in English | MEDLINE | ID: mdl-39282732

ABSTRACT

We develop a methodology for valid inference after variable selection in logistic regression when the responses are partially observed, that is, when one observes a set of error-prone testing outcomes instead of the true values of the responses. Aiming at selecting important covariates while accounting for missing information in the response data, we apply the expectation-maximization algorithm to compute maximum likelihood estimators subject to LASSO penalization. Subsequent to variable selection, we make inferences on the selected covariate effects by extending post-selection inference methodology based on the polyhedral lemma. Empirical evidence from our extensive simulation study suggests that our post-selection inference results are more reliable than those from naive inference methods that use the same data to perform variable selection and inference without adjusting for variable selection.


Subject(s)
Algorithms , Computer Simulation , Likelihood Functions , Humans , Logistic Models , Data Interpretation, Statistical , Biometry/methods , Models, Statistical
7.
Stat Med ; 2024 Aug 27.
Article in English | MEDLINE | ID: mdl-39189687

ABSTRACT

Mild cognitive impairment (MCI) is a prodromal stage of Alzheimer's disease (AD) that causes a significant burden in caregiving and medical costs. Clinically, the diagnosis of MCI is determined by the impairment statuses of five cognitive domains. If one of these cognitive domains is impaired, the patient is diagnosed with MCI, and if two out of the five domains are impaired, the patient is diagnosed with AD. In medical records, most of the time, the diagnosis of MCI/AD is given, but not the statuses of the five domains. We may treat the domain statuses as missing variables. This diagnostic procedure relates MCI/AD status modeling to multiple-instance learning, where each domain resembles an instance. However, traditional multiple-instance learning assumes common predictors among instances, but in our case, each domain is associated with different predictors. In this article, we generalized the multiple-instance logistic regression to accommodate the heterogeneity in predictors among different instances. The proposed model is dubbed heterogeneous-instance logistic regression and is estimated via the expectation-maximization algorithm because of the presence of the missing variables. We also derived two variants of the proposed model for the MCI and AD diagnoses. The proposed model is validated in terms of its estimation accuracy, latent status prediction, and robustness via extensive simulation studies. Finally, we analyzed the National Alzheimer's Coordinating Center-Uniform Data Set using the proposed model and demonstrated its potential.

8.
Comput Stat ; 39(5): 2743-2769, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39176239

ABSTRACT

We consider interval censored data with a cured subgroup that arises from longitudinal followup studies with a heterogeneous population where a certain proportion of subjects is not susceptible to the event of interest. We propose a two component mixture cure model, where the first component describing the probability of cure is modeled by a support vector machine-based approach and the second component describing the survival distribution of the uncured group is modeled by a proportional hazard structure. Our proposed model provides flexibility in capturing complex effects of covariates on the probability of cure unlike the traditional models that rely on modeling the cure probability using a generalized linear model with a known link function. For the estimation of model parameters, we develop an expectation maximization-based estimation algorithm. We conduct simulation studies and show that our proposed model performs better in capturing complex effects of covariates on the cure probability when compared to the traditional logit link-based two component mixture cure model. This results in more accurate (smaller bias) and precise (smaller mean square error) estimates of the cure probabilities, which in-turn improves the predictive accuracy of the latent cured status. We further show that our model's ability to capture complex covariate effects also improves the estimation results corresponding to the survival distribution of the uncured. Finally, we apply the proposed model and estimation procedure to an interval censored data on smoking cessation.

9.
Commun Stat Theory Methods ; 53(17): 6038-6054, 2024.
Article in English | MEDLINE | ID: mdl-39100716

ABSTRACT

Phase IV clinical trials are designed to monitor long-term side effects of medical treatment. For instance, childhood cancer survivors treated with chest radiation and/or anthracycline are often at risk of developing cardiotoxicity during their adulthood. Often the primary focus of a study could be on estimating the cumulative incidence of a particular outcome of interest such as cardiotoxicity. However, it is challenging to evaluate patients continuously and usually, this information is collected through cross-sectional surveys by following patients longitudinally. This leads to interval-censored data since the exact time of the onset of the toxicity is unknown. Rai et al. computed the transition intensity rate using a parametric model and estimated parameters using maximum likelihood approach in an illness-death model. However, such approach may not be suitable if the underlying parametric assumptions do not hold. This manuscript proposes a semi-parametric model, with a logit relationship for the treatment intensities in two groups, to estimate the transition intensity rates within the context of an illness-death model. The estimation of the parameters is done using an EM algorithm with profile likelihood. Results from the simulation studies suggest that the proposed approach is easy to implement and yields comparable results to the parametric model.

10.
bioRxiv ; 2024 Aug 06.
Article in English | MEDLINE | ID: mdl-39149243

ABSTRACT

Cellular deconvolution aims to estimate cell type fractions from bulk transcriptomic and other omics data. Most existing deconvolution methods fail to account for the heterogeneity in cell type-specific (CTS) expression across bulk samples, ignore discrepancies between CTS expression in bulk and cell type reference data, and provide no guidance on cell type reference selection or integration. To address these issues, we introduce BLEND, a hierarchical Bayesian method that leverages multiple reference datasets. BLEND learns the most suitable references for each bulk sample by exploring the convex hulls of references and employs a "bag-of-words" representation for bulk count data for deconvolution. To speed up the computation, we provide an efficient EM algorithm for parameter estimation. Notably, BLEND requires no data transformation, normalization, cell type marker gene selection, or reference quality evaluation. Benchmarking studies on both simulated and real human brain data highlight BLEND's superior performance in various scenarios. The analysis of Alzheimer's disease data illustrates BLEND's application in real data and reference resource integration.

11.
Psychometrika ; 2024 Jul 05.
Article in English | MEDLINE | ID: mdl-38967857

ABSTRACT

Cognitive diagnostic models (CDMs) are a popular family of discrete latent variable models that model students' mastery or deficiency of multiple fine-grained skills. CDMs have been most widely used to model categorical item response data such as binary or polytomous responses. With advances in technology and the emergence of varying test formats in modern educational assessments, new response types, including continuous responses such as response times, and count-valued responses from tests with repetitive tasks or eye-tracking sensors, have also become available. Variants of CDMs have been proposed recently for modeling such responses. However, whether these extended CDMs are identifiable and estimable is entirely unknown. We propose a very general cognitive diagnostic modeling framework for arbitrary types of multivariate responses with minimal assumptions, and establish identifiability in this general setting. Surprisingly, we prove that our general-response CDMs are identifiable under Q -matrix-based conditions similar to those for traditional categorical-response CDMs. Our conclusions set up a new paradigm of identifiable general-response CDMs. We propose an EM algorithm to efficiently estimate a broad class of exponential family-based general-response CDMs. We conduct simulation studies under various response types. The simulation results not only corroborate our identifiability theory, but also demonstrate the superior empirical performance of our estimation algorithms. We illustrate our methodology by applying it to a TIMSS 2019 response time dataset.

12.
Syst Biol ; 2024 Jul 05.
Article in English | MEDLINE | ID: mdl-38970346

ABSTRACT

Dating phylogenetic trees to obtain branch lengths in time unit is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation- Maximization (EM) algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.

13.
J Indian Soc Probab Stat ; 25: 17-45, 2024 Jun.
Article in English | MEDLINE | ID: mdl-39070705

ABSTRACT

Studies/trials assessing status and progression of periodontal disease (PD) usually focus on quantifying the relationship between the clustered (tooth within subjects) bivariate endpoints, such as probed pocket depth (PPD), and clinical attachment level (CAL) with the covariates. Although assumptions of multivariate normality can be invoked for the random terms (random effects and errors) under a linear mixed model (LMM) framework, violations of those assumptions may lead to imprecise inference. Furthermore, the response-covariate relationship may not be linear, as assumed under a LMM fit, and the regression estimates obtained therein do not provide an overall summary of the risk of PD, as obtained from the covariates. Motivated by a PD study on Gullah-speaking African-American Type-2 diabetics, we cast the asymmetric clustered bivariate (PPD and CAL) responses into a non-linear mixed model framework, where both random terms follow the multivariate asymmetric Laplace distribution (ALD). In order to provide a one-number risk summary, the possible non-linearity in the relationship is modeled via a single-index model, powered by polynomial spline approximations for index functions, and the normal mixture expression for ALD. To proceed with a maximum-likelihood inferential setup, we devise an elegant EM-type algorithm. Moreover, the large sample theoretical properties are established under some mild conditions. Simulation studies using synthetic data generated under a variety of scenarios were used to study the finite-sample properties of our estimators, and demonstrate that our proposed model and estimation algorithm can efficiently handle asymmetric, heavy-tailed data, with outliers. Finally, we illustrate our proposed methodology via application to the motivating PD study.

14.
Stat Med ; 43(20): 3899-3920, 2024 Sep 10.
Article in English | MEDLINE | ID: mdl-38932470

ABSTRACT

Motivated by a DNA methylation application, this article addresses the problem of fitting and inferring a multivariate binomial regression model for outcomes that are contaminated by errors and exhibit extra-parametric variations, also known as dispersion. While dispersion in univariate binomial regression has been extensively studied, addressing dispersion in the context of multivariate outcomes remains a complex and relatively unexplored task. The complexity arises from a noteworthy data characteristic observed in our motivating dataset: non-constant yet correlated dispersion across outcomes. To address this challenge and account for possible measurement error, we propose a novel hierarchical quasi-binomial varying coefficient mixed model, which enables flexible dispersion patterns through a combination of additive and multiplicative dispersion components. To maximize the Laplace-approximated quasi-likelihood of our model, we further develop a specialized two-stage expectation-maximization (EM) algorithm, where a plug-in estimate for the multiplicative scale parameter enhances the speed and stability of the EM iterations. Simulations demonstrated that our approach yields accurate inference for smooth covariate effects and exhibits excellent power in detecting non-zero effects. Additionally, we applied our proposed method to investigate the association between DNA methylation, measured across the genome through targeted custom capture sequencing of whole blood, and levels of anti-citrullinated protein antibodies (ACPA), a preclinical marker for rheumatoid arthritis (RA) risk. Our analysis revealed 23 significant genes that potentially contribute to ACPA-related differential methylation, highlighting the relevance of cell signaling and collagen metabolism in RA. We implemented our method in the R Bioconductor package called "SOMNiBUS."


Subject(s)
Algorithms , Computer Simulation , DNA Methylation , Models, Statistical , Humans , Multivariate Analysis , Arthritis, Rheumatoid/genetics , Likelihood Functions , Sulfites/chemistry , Sequence Analysis, DNA/methods
15.
J Appl Stat ; 51(7): 1318-1343, 2024.
Article in English | MEDLINE | ID: mdl-38835830

ABSTRACT

Autoregressive models in time series are useful in various areas. In this article, we propose a skew-t autoregressive model. We estimate its parameters using the expectation-maximization (EM) method and develop the influence methodology based on local perturbations for its validation. We obtain the normal curvatures for four perturbation strategies to identify influential observations, and then to assess their performance through Monte Carlo simulations. An example of financial data analysis is presented to study daily log-returns for Brent crude futures and investigate possible impact by the COVID-19 pandemic.

16.
Stat Med ; 43(19): 3578-3594, 2024 Aug 30.
Article in English | MEDLINE | ID: mdl-38881189

ABSTRACT

In health and clinical research, medical indices (eg, BMI) are commonly used for monitoring and/or predicting health outcomes of interest. While single-index modeling can be used to construct such indices, methods to use single-index models for analyzing longitudinal data with multiple correlated binary responses are underdeveloped, although there are abundant applications with such data (eg, prediction of multiple medical conditions based on longitudinally observed disease risk factors). This article aims to fill the gap by proposing a generalized single-index model that can incorporate multiple single indices and mixed effects for describing observed longitudinal data of multiple binary responses. Compared to the existing methods focusing on constructing marginal models for each response, the proposed method can make use of the correlation information in the observed data about different responses when estimating different single indices for predicting response variables. Estimation of the proposed model is achieved by using a local linear kernel smoothing procedure, together with methods designed specifically for estimating single-index models and traditional methods for estimating generalized linear mixed models. Numerical studies show that the proposed method is effective in various cases considered. It is also demonstrated using a dataset from the English Longitudinal Study of Aging project.


Subject(s)
Models, Statistical , Longitudinal Studies , Humans , Linear Models , Computer Simulation , Data Interpretation, Statistical
17.
Stat Med ; 43(19): 3723-3741, 2024 Aug 30.
Article in English | MEDLINE | ID: mdl-38890118

ABSTRACT

We consider the Bayesian estimation of the parameters of a finite mixture model from independent order statistics arising from imperfect ranked set sampling designs. As a cost-effective method, ranked set sampling enables us to incorporate easily attainable characteristics, as ranking information, into data collection and Bayesian estimation. To handle the special structure of the ranked set samples, we develop a Bayesian estimation approach exploiting the Expectation-Maximization (EM) algorithm in estimating the ranking parameters and Metropolis within Gibbs Sampling to estimate the parameters of the underlying mixture model. Our findings show that the proposed RSS-based Bayesian estimation method outperforms the commonly used Bayesian counterpart using simple random sampling. The developed method is finally applied to estimate the bone disorder status of women aged 50 and older.


Subject(s)
Algorithms , Bayes Theorem , Models, Statistical , Humans , Female , Middle Aged , Aged , Computer Simulation , Monte Carlo Method , Likelihood Functions , Markov Chains
18.
J Appl Stat ; 51(9): 1792-1817, 2024.
Article in English | MEDLINE | ID: mdl-38933142

ABSTRACT

Proportional data arise frequently in a wide variety of fields of study. Such data often exhibit extra variation such as over/under dispersion, sparseness and zero inflation. For example, the hepatitis data present both sparseness and zero inflation with 19 contributing non-zero denominators of 5 or less and with 36 having zero seropositive out of 83 annual age groups. The whitefly data consists of 640 observations with 339 zeros (53%), which demonstrates extra zero inflation. The catheter management data involve excessive zeros with over 60% zeros averagely for outcomes of 193 urinary tract infections, 194 outcomes of catheter blockages and 193 outcomes of catheter displacements. However, the existing models cannot always address such features appropriately. In this paper, a new two-parameter probability distribution called Lindley-binomial (LB) distribution is proposed to analyze the proportional data with such features. The probabilistic properties of the distribution such as moment, moment generating function are derived. The Fisher scoring algorithm and EM algorithm are presented for the computation of estimates of parameters in the proposed LB regression model. The issues on goodness of fit for the LB model are discussed. A limited simulation study is also performed to evaluate the performance of derived EM algorithms for the estimation of parameters in the model with/without covariates. The proposed model is illustrated through three aforementioned proportional datasets.

19.
Entropy (Basel) ; 26(5)2024 May 15.
Article in English | MEDLINE | ID: mdl-38785671

ABSTRACT

Finite mixture of linear regression (FMLR) models are among the most exemplary statistical tools to deal with various heterogeneous data. In this paper, we introduce a new procedure to simultaneously determine the number of components and perform variable selection for the different regressions for FMLR models via an exponential power error distribution, which includes normal distributions and Laplace distributions as special cases. Under some regularity conditions, the consistency of order selection and the consistency of variable selection are established, and the asymptotic normality for the estimators of non-zero parameters is investigated. In addition, an efficient modified expectation-maximization (EM) algorithm and a majorization-maximization (MM) algorithm are proposed to implement the proposed optimization problem. Furthermore, we use the numerical simulations to demonstrate the finite sample performance of the proposed methodology. Finally, we apply the proposed approach to analyze a baseball salary data set. Results indicate that our proposed method obtains a smaller BIC value than the existing method.

20.
Multivariate Behav Res ; 59(4): 801-817, 2024.
Article in English | MEDLINE | ID: mdl-38784986

ABSTRACT

Networks consist of interconnected units, known as nodes, and allow to formally describe interactions within a system. Specifically, bipartite networks depict relationships between two distinct sets of nodes, designated as sending and receiving nodes. An integral aspect of bipartite network analysis often involves identifying clusters of nodes with similar behaviors. The computational complexity of models for large bipartite networks poses a challenge. To mitigate this challenge, we employ a Mixture of Latent Trait Analyzers (MLTA) for node clustering. Our approach extends the MLTA to include covariates and introduces a double EM algorithm for estimation. Applying our method to COVID-19 data, with sending nodes representing patients and receiving nodes representing preventive measures, enables dimensionality reduction and the identification of meaningful groups. We present simulation results demonstrating the accuracy of the proposed method.


Subject(s)
Algorithms , COVID-19 , Models, Statistical , Humans , Computer Simulation , Cluster Analysis , SARS-CoV-2
SELECTION OF CITATIONS
SEARCH DETAIL