Pesquisa | Portal Regional da BVS

1.

Estimating linear mixed effect models with non-normal random effects through saddlepoint approximation and its application in retail pricing analytics.

Chen, Hao; Han, Lanshan; Lim, Alvin.

J Appl Stat ; 51(11): 2116-2138, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-39157268

RESUMO

Linear Mixed Effects (LME) models are powerful statistical tools that have been employed in many different real-world applications such as retail data analytics, marketing measurement, and medical research. Statistical inference is often conducted via maximum likelihood estimation with Normality assumptions on the random effects. Nevertheless, for many applications in the retail industry, it is often necessary to consider non-Normal distributions on the random effects when considering the unknown parameters' business interpretations. Motivated by this need, a linear mixed effects model with possibly non-Normal distribution is studied in this research. We propose a general estimating framework based on a saddlepoint approximation (SA) of the probability density function of the dependent variable, which leads to constrained nonlinear optimization problems. The classical LME model with Normality assumption can then be viewed as a special case under the proposed general SA framework. Compared with the existing approach, the proposed method enhances the real-world interpretability of the estimates with satisfactory model fits.

2.

Predicting dichotomised outcomes from high-dimensional data in biomedicine.

Rauschenberger, Armin; Glaab, Enrico.

J Appl Stat ; 51(9): 1756-1771, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38933137

RESUMO

In many biomedical applications, we are more interested in the predicted probability that a numerical outcome is above a threshold than in the predicted value of the outcome. For example, it might be known that antibody levels above a certain threshold provide immunity against a disease, or a threshold for a disease severity score might reflect conversion from the presymptomatic to the symptomatic disease stage. Accordingly, biomedical researchers often convert numerical to binary outcomes (loss of information) to conduct logistic regression (probabilistic interpretation). We address this bad statistical practice by modelling the binary outcome with logistic regression, modelling the numerical outcome with linear regression, transforming the predicted values from linear regression to predicted probabilities, and combining the predicted probabilities from logistic and linear regression. Analysing high-dimensional simulated and experimental data, namely clinical data for predicting cognitive impairment, we obtain significantly improved predictions of dichotomised outcomes. Thus, the proposed approach effectively combines binary with numerical outcomes to improve binary classification in high-dimensional settings. An implementation is available in the R package cornet on GitHub (https://github.com/rauschenberger/cornet) and CRAN (https://CRAN.R-project.org/package=cornet).

3.

A generalized Gompertz promotion time cure model and its fitness to cancer data.

Tahira, Ayesha; Danish, Muhammad Yameen.

Heliyon ; 10(11): e32038, 2024 Jun 15.

Artigo em Inglês | MEDLINE | ID: mdl-38912437

RESUMO

The cure models based on standard distributions like exponential, Weibull, lognormal, Gompertz, gamma, are often used to analyze survival data from cancer clinical trials with long-term survivors. Sometimes, the data is simple, and the standard cure models fit them very well, however, most often the data are complex and the standard cure models don't fit them reasonably well. In this article, we offer a novel generalized Gompertz promotion time cure model and illustrate its fitness to gastric cancer data by three different methods. The generalized Gompertz distribution is as simple as the generalized Weibull distribution and is not computationally as intensive as the generalized F distribution. One detailed real data application is provided for illustration and comparison purposes.

4.

Robust explicit estimators using the power-weighted repeated medians.

Park, Chanseok; Gao, Xuehong; Wang, Min.

J Appl Stat ; 51(8): 1590-1608, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38863800

RESUMO

This paper consists of two parts. The first part of the paper is to propose an explicit robust estimation method for the regression coefficients in simple linear regression based on the power-weighted repeated medians technique that has a tuning constant for dealing with the trade-offs between efficiency and robustness. We then investigate the lower and upper bounds of the finite-sample breakdown point of the proposed method. The second part of the paper is to show that based on the linearization of the cumulative distribution function, the proposed method can be applied to obtain robust parameter estimators for the Weibull and Birnbaum-Saunders distributions that are commonly used in both reliability and survival analysis. Numerical studies demonstrate that the proposed method performs well in a manner that is approximately comparable with the ordinary least squares method, whereas it is far superior in the presence of data contamination that occurs frequently in practice.

5.

Modeling the spatial patterns of antenatal care utilization in Nigeria with inference based on Pólya-Gamma mixtures.

Egbon, Osafu Augustine; Gayawan, Ezra.

J Appl Stat ; 51(5): 866-890, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38524798

RESUMO

Despite the vast advantages of making antenatal care visits, the service utilization among pregnant women in Nigeria is suboptimal. A five-year monitoring estimate indicated that about 24% of the women who had live births made no visit. The non-utilization induced excessive zeroes in the outcome of interest. Thus, this study adopted a zero-inflated negative binomial model within a Bayesian framework to identify the spatial pattern and the key factors hindering antenatal care utilization in Nigeria. We overcome the intractability associated with posterior inference by adopting a Pólya-Gamma data-augmentation technique to facilitate inference. The Gibbs sampling algorithm was used to draw samples from the joint posterior distribution. Results revealed that type of place of residence, maternal level of education, access to mass media, household work index, and woman's working status have significant effects on the use of antenatal care services. Findings identified substantial state-level spatial disparity in antenatal care utilization across the country. Cost-effective techniques to achieve an acceptable frequency of utilization include the creation of a community-specific awareness to emphasize the importance and benefits of the appropriate utilization. Special consideration should be given to older pregnant women, women in poor antenatal utilization states, and women residing in poor road network regions.

6.

Unit upper truncated Weibull distribution with extension to 0 and 1 inflated model - Theory and applications.

Okorie, Idika E; Afuecheta, Emmanuel; Bakouch, Hassan S.

Heliyon ; 9(11): e22260, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-38058617

RESUMO

A two-parameter unit distribution and its regression model plus its extension to 0 and 1 inflation is introduced and studied. The distribution is called the unit upper truncated Weibull (UUTW) distribution, while the inflated variant is called the 0-1 inflated unit upper truncated Weibull (ZOIUUTW) distribution. The UUTW distribution has an increasing and a J-shaped hazard rate function. The parameters of the proposed models are estimated by the method of maximum likelihood estimation. For the UUTW distribution, two practical examples involving household expenditure and maximum flood level data are used to show its flexibility and the proposed distribution demonstrates better fit tendencies than some of the competing unit distributions. Application of the proposed regression model demonstrates adequate capability in describing the real data set with better modeling proficiency than the existing competing models. Then, for the ZOIUUTW distribution, the CD34+ data involving cancer patients are analyzed to show the flexibility of the model in characterizing inflation at both endpoints of the unit interval.

7.

Uncovering Alterations in Cancer Epigenetics via Trans-Dimensional Markov Chain Monte Carlo and Hidden Markov Models.

Shokoohi, Farhad; Khaniki, Saeedeh Hajebi.

bioRxiv ; 2023 Jun 15.

Artigo em Inglês | MEDLINE | ID: mdl-37398181

RESUMO

Epigenetic alterations are key drivers in the development and progression of cancer. Identifying differentially methylated cytosines (DMCs) in cancer samples is a crucial step toward understanding these changes. In this paper, we propose a trans-dimensional Markov chain Monte Carlo (TMCMC) approach that uses hidden Markov models (HMMs) with binomial emission, and bisulfite sequencing (BS-Seq) data, called DMCTHM, to identify DMCs in cancer epigenetic studies. We introduce the Expander-Collider penalty to tackle under and over-estimation in TMCMC-HMMs. We address all known challenges inherent in BS-Seq data by introducing novel approaches for capturing functional patterns and autocorrelation structure of the data, as well as for handling missing values, multiple covariates, multiple comparisons, and family-wise errors. We demonstrate the effectiveness of DMCTHM through comprehensive simulation studies. The results show that our proposed method outperforms other competing methods in identifying DMCs. Notably, with DMCTHM, we uncovered new DMCs and genes in Colorectal cancer that were significantly enriched in the Tp53 pathway.

8.

The re-parameterized inverse Gaussian regression to model length of stay of COVID-19 patients in the public health care system of Piracicaba, Brazil.

Hashimoto, E M; Ortega, E M M; Cordeiro, G M; Cancho, V G; Silva, I.

J Appl Stat ; 50(8): 1665-1685, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37260477

RESUMO

Among the models applied to analyze survival data, a standout is the inverse Gaussian distribution, which belongs to the class of models to analyze positive asymmetric data. However, the variance of this distribution depends on two parameters, which prevents establishing a functional relation with a linear predictor when the assumption of constant variance does not hold. In this context, the aim of this paper is to re-parameterize the inverse Gaussian distribution to enable establishing an association between a linear predictor and the variance. We propose deviance residuals to verify the model assumptions. Some simulations indicate that the distribution of these residuals approaches the standard normal distribution and the mean squared errors of the estimators are small for large samples. Further, we fit the new model to hospitalization times of COVID-19 patients in Piracicaba (Brazil) which indicates that men spend more time hospitalized than women, and this pattern is more pronounced for individuals older than 60 years. The re-parameterized inverse Gaussian model proved to be a good alternative to analyze censored data with non-constant variance.

9.

LIC criterion for optimal subset selection in distributed interval estimation.

Guo, Guangbao; Sun, Yue; Qian, Guoqi; Wang, Qian.

J Appl Stat ; 50(9): 1900-1920, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37378273

RESUMO

Distributed interval estimation in linear regression may be computationally infeasible in the presence of big data that are normally stored in different computer servers or in cloud. The existing challenge represents the results from the distributed estimation may still contain redundant information about the population characteristics of the data. To tackle this computing challenge, we develop an optimization procedure to select the best subset from the collection of data subsets, based on which we perform interval estimation in the context of linear regression. The procedure is derived based on minimizing the length of the final interval estimator and maximizing the information remained in the selected data subset, thus is named as the LIC criterion. Theoretical performance of the LIC criterion is studied in this paper together with a simulation study and real data analysis.

10.

Predicting and explaining absenteeism risk in hospital patients before and during COVID-19.

Borges, Ana; Carvalho, Mariana; Maia, Miguel; Guimarães, Miguel; Carneiro, Davide.

Socioecon Plann Sci ; 87: 101549, 2023 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-37255583

RESUMO

In order to address one of the most challenging problems in hospital management - patients' absenteeism without prior notice - this study analyses the risk factors associated with this event. To this end, through real data from a hospital located in the North of Portugal, a prediction model previously validated in the literature is used to infer absenteeism risk factors, and an explainable model is proposed, based on a modified CART algorithm. The latter intends to generate a human-interpretable explanation for patient absenteeism, and its implementation is described in detail. Furthermore, given the significant impact, the COVID-19 pandemic had on hospital management, a comparison between patients' profiles upon absenteeism before and during the COVID-19 pandemic situation is performed. Results obtained differ between hospital specialities and time periods meaning that patient profiles on absenteeism change during pandemic periods and within specialities.

11.

Next-Generation Sequencing Data-Based Association Testing of a Group of Genetic Markers for Complex Responses Using a Generalized Linear Model Framework.

Xu, Zheng; Yan, Song; Wu, Cong; Duan, Qing; Chen, Sixia; Li, Yun.

Mathematics (Basel) ; 11(11)2023 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-38721066

RESUMO

Association testing has been widely used to study the relationship between genetic variants and phenotypes. Most association testing methods are genotype-based, i.e. first estimate genotype and then regress phenotype on estimated genotype and other variables. Directly testing methods based on next generation sequencing (NGS) data without genotype calling have been proposed and shown advantage over genotype-based methods in the scenarios when genotype calling is not accurate. NGS data-based single-variant testing have been proposed including our previously proposed single-variant testing method, i.e. UNC combo method [1]. NGS data-based group testing methods for continuous phenotype have also been proposed by us using a linear model framework which can handle continuous responses [2]. In this paper, we extend our linear model-based framework to a generalized linear model-based framework so that the methods can handle other types of responses especially binary responses which is commonly-faced in association studies. We have conducted extensive simulation studies to evaluate the performance of different estimators and compare our estimators with their corresponding genotype-based methods. We found that all methods have Type I errors controlled, and our NGS data-based testing methods have better performance than their corresponding genotype-based methods in the literature for other types of responses including binary responses (logistic regression) and count responses (Poisson regression especially when sequencing depth is low. In conclusion, we have extended our previous linear model (LM) framework to a generalized linear model (GLM) framework and derived NGS data-based testing methods for a group of genetic variants. Compared with our previously proposed LM-based methods [2], the new GLM-based methods can handle more complex responses (for example, binary responses and count responses) in addition to continuous responses. Our methods have filled the literature gap and shown advantage over their corresponding genotype-based methods in the literature.

12.

The more data, the better? Demystifying deletion-based methods in linear regression with missing data.

Xu, Tianchen; Chen, Kun; Li, Gen.

Stat Interface ; 15(4): 515-526, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36540373

RESUMO

We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased under missing completely at random (MCAR) and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further conduct simulation studies to corroborate the findings and demystify what has been missed or misinterpreted in the literature. Some detailed proofs and simulation results are available in the online supplemental materials.

13.

Regression models using the LINEX loss to predict lower bounds for the number of points for approximating planar contour shapes.

Jayasinghe, J M Thilini; Ellingson, Leif; Prematilake, Chalani.

J Appl Stat ; 49(16): 4294-4313, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36353295

RESUMO

Researchers in statistical shape analysis often analyze outlines of objects. Even though these contours are infinite-dimensional in theory, they must be discretized in practice. When discretizing, it is important to reduce the number of sampling points considerably to reduce computational costs, but to not use too few points so as to result in too much approximation error. Unfortunately, determining the minimum number of points needed to achieve sufficiently approximate the contours is computationally expensive. In this paper, we fit regression models to predict these lower bounds using characteristics of the contours that are computationally cheap as predictor variables. However, least squares regression is inadequate for this task because it treats overestimation and underestimation equally, but underestimation of lower bounds is far more serious. Instead, to fit the models, we use the LINEX loss function, which allows us to penalize underestimation at an exponential rate while penalizing overestimation only linearly. We present a novel approach to select the shape parameter of the loss function and tools for analyzing how well the model fits the data. Through validation methods, we show that the LINEX models work well for reducing the underestimation for the lower bounds.

14.

Doubly multivariate linear models with block exchangeable distributed errors and site-dependent covariates.

Opheim, Timothy; Roy, Anuradha.

J Appl Stat ; 49(14): 3659-3676, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36246862

RESUMO

The problem of testing the intercept and slope parameters of doubly multivariate linear models with site-dependent covariates using Rao's score test (RST) is studied. The RST statistic is developed for a block exchangeable covariance structure on the error vector under the assumption of multivariate normality. We compare our developed RST statistic with the likelihood ratio test (LRT) statistic. Monte Carlo simulations indicate that the RST statistic is much more accurate than its counterpart LRT statistic and it takes significantly less computation time than the LRT statistic. The proposed method is illustrated with an example of multiple response variables measured on multiple trees in a single plot in an agricultural study.

15.

A robust and efficient variable selection method for linear regression.

Yang, Zhuoran; Fu, Liya; Wang, You-Gan; Dong, Zhixiong; Jiang, Yunlu.

J Appl Stat ; 49(14): 3677-3692, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36246863

RESUMO

Variable selection is fundamental to high dimensional statistical modeling, and many approaches have been proposed. However, existing variable selection methods do not perform well in presence of outliers in response variable or/and covariates. In order to ensure a high probability of correct selection and efficient parameter estimation, we investigate a robust variable selection method based on a modified Huber's function with an exponential squared loss tail. We also prove that the proposed method has oracle properties. Furthermore, we carry out simulation studies to evaluate the performance of the proposed method for both pn. Our simulation results indicate that the proposed method is efficient and robust against outliers and heavy-tailed distributions. Finally, a real dataset from an air pollution mortality study is used to illustrate the proposed method.

16.

CANONICAL THRESHOLDING FOR NON-SPARSE HIGH-DIMENSIONAL LINEAR REGRESSION.

Silin, Igor; Fan, Jianqing.

Ann Stat ; 50(1): 460-486, 2022 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-36148472

RESUMO

We consider a high-dimensional linear regression problem. Unlike many papers on the topic, we do not require sparsity of the regression coefficients; instead, our main structural assumption is a decay of eigenvalues of the covariance matrix of the data. We propose a new family of estimators, called the canonical thresholding estimators, which pick largest regression coefficients in the canonical form. The estimators admit an explicit form and can be linked to LASSO and Principal Component Regression (PCR). A theoretical analysis for both fixed design and random design settings is provided. Obtained bounds on the mean squared error and the prediction error of a specific estimator from the family allow to clearly state sufficient conditions on the decay of eigenvalues to ensure convergence. In addition, we promote the use of the relative errors, strongly linked with the out-of-sample R 2. The study of these relative errors leads to a new concept of joint effective dimension, which incorporates the covariance of the data and the regression coefficients simultaneously, and describes the complexity of a linear regression problem. Some minimax lower bounds are established to showcase the optimality of our procedure. Numerical simulations confirm good performance of the proposed estimators compared to the previously developed methods.

17.

SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.

Hastie, Trevor; Montanari, Andrea; Rosset, Saharon; Tibshirani, Ryan J.

Ann Stat ; 50(2): 949-986, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-36120512

RESUMO

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum â 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ â p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ â p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ â d , W ∈ â p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

18.

Migration and students' performance: detecting geographical differences following a curves clustering approach.

Boscaino, Giovanni; Sottile, Gianluca; Adelfio, Giada.

J Appl Stat ; 49(4): 1018-1032, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35707809

RESUMO

Students' migration mobility is the new form of migration: students migrate to improve their skills and become more valued for the job market. The data regard the migration of Italian Bachelors who enrolled at Master Degree level, moving typically from poor to rich areas. This paper investigates the migration and other possible determinants on the Master Degree students' performance. The Clustering of Effects approach for Quantile Regression Coefficients Modelling has been used to cluster the effects of some variables on the students' performance for three Italian macro-areas. Results show evidence of similarity between Southern and Centre students, with respect to the Northern ones.

19.

Spike-and-slab type variable selection in the Cox proportional hazards model for high-dimensional features.

Wu, Ryan; Ahn, Mihye; Yang, Hojin.

J Appl Stat ; 49(9): 2189-2207, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35755095

RESUMO

In this paper, we develop a variable selection framework with the spike-and-slab prior distribution via the hazard function of the Cox model. Specifically, we consider the transformation of the score and information functions for the partial likelihood function evaluated at the given data from the parameter space into the space generated by the logarithm of the hazard ratio. Thereby, we reduce the nonlinear complexity of the estimation equation for the Cox model and allow the utilization of a wider variety of stable variable selection methods. Then, we use a stochastic variable search Gibbs sampling approach via the spike-and-slab prior distribution to obtain the sparsity structure of the covariates associated with the survival outcome. Additionally, we conduct numerical simulations to evaluate the finite-sample performance of our proposed method. Finally, we apply this novel framework on lung adenocarcinoma data to find important genes associated with decreased survival in subjects with the disease.

20.

Variable selection for partially linear models via Bayesian subset modeling with diffusing prior.

Wang, Jia; Cai, Xizhen; Li, Runze.

J Multivar Anal ; 1832021 May.

Artigo em Inglês | MEDLINE | ID: mdl-33867594

RESUMO

Most existing methods of variable selection in partially linear models (PLM) with ultrahigh dimensional covariates are based on partial residuals, which involve a two-step estimation procedure. While the estimation error produced in the first step may have an impact on the second step, multicollinearity among predictors adds additional challenges in the model selection procedure. In this paper, we propose a new Bayesian variable selection approach for PLM. This new proposal addresses those two issues simultaneously as (1) it is a one-step method which selects variables in PLM, even when the dimension of covariates increases at an exponential rate with the sample size, and (2) the method retains model selection consistency, and outperforms existing ones in the setting of highly correlated predictors. Distinguished from existing ones, our proposed procedure employs the difference-based method to reduce the impact from the estimation of the nonparametric component, and incorporates Bayesian subset modeling with diffusing prior (BSM-DP) to shrink the corresponding estimator in the linear component. The estimation is implemented by Gibbs sampling, and we prove that the posterior probability of the true model being selected converges to one asymptotically. Simulation studies support the theory and the efficiency of our methods as compared to other existing ones, followed by an application in a study of supermarket data.

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA