RESUMEN
We consider the question of variable selection in linear regressions, in the sense of identifying the correct direct predictors (those variables that have nonzero coefficients given all candidate predictors). Best subset selection (BSS) is often considered the "gold standard," with its use being restricted only by its NP-hard nature. Alternatives such as the least absolute shrinkage and selection operator (Lasso) or the Elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed-integer optimization problem so that large problems have become computationally feasible. We present an extensive neutral comparison assessing the ability to select the correct direct predictors of BSS compared to forward stepwise selection (FSS), Lasso, and Enet. The simulation considers a range of settings that are challenging regarding dimensionality (number of observations and variables), signal-to-noise ratios, and correlations between predictors. As fair measure of performance, we primarily used the best possible F1-score for each method, and results were confirmed by alternative performance measures and practical criteria for choosing the tuning parameters and subset sizes. Surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were uncorrelated that BSS reliably outperformed the other methods, even in low-dimensional settings. Furthermore, FSS performed almost identically to BSS. Our results shed new light on the usual presumption of BSS being, in principle, the best choice for selecting the correct direct predictors. Especially for correlated variables, alternatives like Enet are faster and appear to perform better in practical settings.
Asunto(s)
Modelos Lineales , Simulación por ComputadorRESUMEN
The fit of a regression model to new data is often worse due to overfitting. Analysts use variable selection techniques to develop parsimonious regression models, which may introduce bias into regression estimates. Shrinkage methods have been proposed to mitigate overfitting and reduce bias in estimates. Post-estimation shrinkage is an alternative to penalized methods. This study evaluates effectiveness of post-estimation shrinkage in improving prediction performance of full and selected models. Through a simulation study, results were compared with ordinary least squares (OLS) and ridge in full models, and best subset selection (BSS) and lasso in selected models. We focused on prediction errors and the number of selected variables. Additionally, we proposed a modified version of the parameter-wise shrinkage (PWS) approach named non-negative PWS (NPWS) to address weaknesses of PWS. Results showed that no method was superior in all scenarios. In full models, NPWS outperformed global shrinkage, whereas PWS was inferior to OLS. In low correlation with moderate-to-high signal-to-noise ratio (SNR), NPWS outperformed ridge, but ridge performed best in small sample sizes, high correlation, and low SNR. In selected models, all post-estimation shrinkage performed similarly, with global shrinkage slightly inferior. Lasso outperformed BSS and post-estimation shrinkage in small sample sizes, low SNR, and high correlation but was inferior when the opposite was true. Our study suggests that, with sufficient information, NPWS is more effective than global shrinkage in improving prediction accuracy of models. However, in high correlation, small sample sizes, and low SNR, penalized methods generally outperform post-estimation shrinkage methods.
Asunto(s)
Biometría , Biometría/métodos , Modelos Lineales , HumanosRESUMEN
Best-subset selection aims to find a small subset of predictors, so that the resulting linear model is expected to have the most desirable prediction accuracy. It is not only important and imperative in regression analysis but also has far-reaching applications in every facet of research, including computer science and medicine. We introduce a polynomial algorithm, which, under mild conditions, solves the problem. This algorithm exploits the idea of sequencing and splicing to reach a stable solution in finite steps when the sparsity level of the model is fixed but unknown. We define an information criterion that helps the algorithm select the true sparsity level with a high probability. We show that when the algorithm produces a stable optimal solution, that solution is the oracle estimator of the true parameters with probability one. We also demonstrate the power of the algorithm in several numerical studies.
Asunto(s)
Aprendizaje Automático , Modelos EstadísticosRESUMEN
The rate and extent of biodegradation of petroleum hydrocarbons in the different aquatic environments is an important element to address. The major avenue for removing petroleum hydrocarbons from the environment is thought to be biodegradation. The present study involves the development of predictive quantitative structure-property relationship (QSPR) models for the primary biodegradation half-life of petroleum hydrocarbons that may be used to forecast the biodegradation half-life of untested petroleum hydrocarbons within the established models' applicability domain. These models use easily computable two-dimensional (2D) descriptors to investigate important structural characteristics needed for the biodegradation of petroleum hydrocarbons in freshwater (dataset 1), temperate seawater (dataset 2), and arctic seawater (dataset 3). All the developed models follow OECD guidelines. We have used double cross-validation, best subset selection, and partial least squares tools for model development. In addition, the small dataset modeler tool has been successfully used for the dataset with very few compounds (dataset 3 with 17 compounds), where dataset division was not possible. The resultant models are robust, predictive, and mechanistically interpretable based on both internal and external validation metrics (R2 range of 0.605-0.959. Q2(Loo) range of 0.509-0.904, and Q2F1 range of 0.526-0.959). The intelligent consensus predictor tool has been used for the improvement of the prediction quality for test set compounds which provided superior outcomes to those from individual partial least squares models based on several metrics (Q2F1 = 0.808 and Q2F2 = 0.805 for dataset 1 in freshwater). Molecular size and hydrophilic factor for freshwater, frequency of two carbon atoms at topological distance 4 for temperate seawater, and electronegative atom count relative to size for arctic seawater were found to be the most significant descriptors responsible for the regulation of biodegradation half-life of petroleum hydrocarbons.
Asunto(s)
Contaminación por Petróleo , Petróleo , Petróleo/metabolismo , Hidrocarburos/química , Agua de Mar/química , Biodegradación Ambiental , Relación Estructura-Actividad CuantitativaRESUMEN
Excessive soil salt content (SSC) seriously affects the crop growth and economic benefits in the agricultural production area. Prior research mainly focused on estimating the salinity in the top bare soil rather than in deep soil that is vital to crop growth. For this end, an experiment was carried out in the Hetao Irrigation District, Inner Mongolia, China. In the experiment, the SSC at different depths under vegetation was measured, and the Sentinel-1 radar images were obtained synchronously. The radar backscattering coefficients (VV and VH) were combined to construct multiple indices, whose sensitivity was then analyzed using the best subset selection (BSS). Meanwhile, four most commonly used algorithms, partial least squares regression (PLSR), quantile regression (QR), support vector machine (SVM), and extreme learning machine (ELM), were utilized to construct estimation models of salinity at the depths of 0-10, 10-20, 0-20, 20-40, 0-40, 40-60 and 0-60 cm before and after BSS, respectively. The results showed: (a) radar remote sensing can be used to estimate the salinity in the root zone of vegetation (0-30 cm); (b) after BSS, the correlation coefficients and estimation accuracy of the four monitoring models were all improved significantly; (c) the estimation accuracy of the four regression models was: SVM > QR > ELM > PLSR; and (d) among the seven sampling depths, 10-20 cm was the optimal inversion depth for all the four models, followed by 20-40 and 0-40 cm. Among the four models, SVM was higher in accuracy than the other three at 10-20 cm (RP 2 = 0.67, RMSEP = 0.12%). These findings can provide valuable guidance for soil salinity monitoring and agricultural production in the arid or semi-arid areas under vegetation.
Asunto(s)
Tecnología de Sensores Remotos , Suelo , Tecnología de Sensores Remotos/métodos , Radar , Cloruro de Sodio , Cloruro de Sodio DietéticoRESUMEN
(1) Background: Predicting chronic low back pain (LBP) is of clinical and economic interest as LBP leads to disabilities and health service utilization. This study aims to build a competitive and interpretable prediction model; (2) Methods: We used clinical and claims data of 3837 participants of a population-based cohort study to predict future LBP consultations (ICD-10: M40.XX-M54.XX). Best subset selection (BSS) was applied in repeated random samples of training data (75% of data); scoring rules were used to identify the best subset of predictors. The rediction accuracy of BSS was compared to randomforest and support vector machines (SVM) in the validation data (25% of data); (3) Results: The best subset comprised 16 out of 32 predictors. Previous occurrence of LBP increased the odds for future LBP consultations (odds ratio (OR) 6.91 [5.05; 9.45]), while concomitant diseases reduced the odds (1 vs. 0, OR: 0.74 [0.57; 0.98], >1 vs. 0: 0.37 [0.21; 0.67]). The area-under-curve (AUC) of BSS was acceptable (0.78 [0.74; 0.82]) and comparable with SVM (0.78 [0.74; 0.82]) and randomforest (0.79 [0.75; 0.83]); (4) Conclusions: Regarding prediction accuracy, BSS has been considered competitive with established machine-learning approaches. Nonetheless, considerable misclassification is inherent and further refinements are required to improve predictions.
Asunto(s)
Dolor de la Región Lumbar , Médicos , Estudios de Cohortes , Humanos , Dolor de la Región Lumbar/epidemiología , Aprendizaje Automático , Derivación y ConsultaRESUMEN
BACKGROUND: Frailty is a syndrome that diminishes the potential for functional recovery in liver cirrhosis (LC). However, its utility is limited due to sole reliance on physical performance, especially in hospitalized patients. We investigate the predictive value of a modified self-reported Frailty Index in cirrhotics, and identify which health deficits play more important roles. METHODS: Consecutive LC patients were assessed by our frailty scale. Outcomes of interest were mortality for 90-day, 1-year and 2-year. Independent predictors were identified by multivariate Cox regression. Receiver operating characteristic curve (ROC) was performed to evaluate discriminative ability. We used a combination of stepwise selection, best subset selection, and Akaike information criteria (AIC) to identify pivotal frailty components. RESULTS: The study cohort consisted of 158 patients, in which 37 expired during follow-up. Compared with non-frail groups, the frail group had higher 1- and 2-year mortality. The area under ROC of the Child-Turcotte-Pugh classification (CTP) and Frailty Index were 0.66 and 0.68, while 0.72 for CTP + Frailty Index (P=0.034), respectively. The optimal predictors comprised instrumental activities of daily living (IADL) limitation, falls and loss of weight (AIC =170, C-statistic =0.67). CONCLUSIONS: It is plausible for incorporating Frailty Index to improve prognostication in cirrhotics. IADL limitation, falls and loss of weight play more crucial roles on mortality determination.
RESUMEN
In this paper, we propose the hard thresholding regression (HTR) for estimating high-dimensional sparse linear regression models. HTR uses a two-stage convex algorithm to approximate the â 0-penalized regression: The first stage calculates a coarse initial estimator, and the second stage identifies the oracle estimator by borrowing information from the first one. Theoretically, the HTR estimator achieves the strong oracle property over a wide range of regularization parameters. Numerical examples and a real data example lend further support to our proposed methodology.
RESUMEN
We propose a scalable algorithmic framework for exact Bayesian variable selection and model averaging in linear models under the assumption that the Gram matrix is block-diagonal, and as a heuristic for exploring the model space for general designs. In block-diagonal designs our approach returns the most probable model of any given size without resorting to numerical integration. The algorithm also provides a novel and efficient solution to the frequentist best subset selection problem for block-diagonal designs. Posterior probabilities for any number of models are obtained by evaluating a single one-dimensional integral, and other quantities of interest such as variable inclusion probabilities and model-averaged regression estimates are obtained by an adaptive, deterministic one-dimensional numerical integration. The overall computational cost scales linearly with the number of blocks, which can be processed in parallel, and exponentially with the block size, rendering it most adequate in situations where predictors are organized in many moderately-sized blocks. For general designs, we approximate the Gram matrix by a block-diagonal matrix using spectral clustering and propose an iterative algorithm that capitalizes on the block-diagonal algorithms to explore efficiently the model space. All methods proposed in this paper are implemented in the R library mombf.