RESUMO
Genetic prediction holds immense promise for translating genetic discoveries into medical advances. As the high-dimensional covariance matrix (or the linkage disequilibrium (LD) pattern) of genetic variants often presents a block-diagonal structure, numerous methods account for the dependence among variants in predetermined local LD blocks. Moreover, due to privacy considerations and data protection concerns, genetic variant dependence in each LD block is typically estimated from external reference panels rather than the original training data set. This paper presents a unified analysis of blockwise and reference panel-based estimators in a high-dimensional prediction framework without sparsity restrictions. We find that, surprisingly, even when the covariance matrix has a block-diagonal structure with well-defined boundaries, blockwise estimation methods adjusting for local dependence can be substantially less accurate than methods controlling for the whole covariance matrix. Further, estimation methods built on the original training data set and external reference panels are likely to have varying performance in high dimensions, which may reflect the cost of having only access to summary level data from the training data set. This analysis is based on novel results in random matrix theory for block-diagonal covariance matrix. We numerically evaluate our results using extensive simulations and real data analysis in the UK Biobank.
RESUMO
When analyzing data combined from multiple sources (e.g., hospitals, studies), the heterogeneity across different sources must be accounted for. In this paper, we consider high-dimensional linear regression models for integrative data analysis. We propose a new adaptive clustering penalty (ACP) method to simultaneously select variables and cluster source-specific regression coefficients with sub-homogeneity. We show that the estimator based on the ACP method enjoys a strong oracle property under certain regularity conditions. We also develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) for parameter estimation. We conduct simulation studies to compare the performance of the proposed method to three existing methods (a fused LASSO with adjacent fusion, a pairwise fused LASSO, and a multi-directional shrinkage penalty method). Finally, we apply the proposed method to the multi-center Childhood Adenotonsillectomy Trial to identify sub-homogeneity in the treatment effects across different study sites.
Insérer votre résumé ici. We will supply a French abstract for those authors who can't prepare it themselves.
RESUMO
In order to address one of the most challenging problems in hospital management - patients' absenteeism without prior notice - this study analyses the risk factors associated with this event. To this end, through real data from a hospital located in the North of Portugal, a prediction model previously validated in the literature is used to infer absenteeism risk factors, and an explainable model is proposed, based on a modified CART algorithm. The latter intends to generate a human-interpretable explanation for patient absenteeism, and its implementation is described in detail. Furthermore, given the significant impact, the COVID-19 pandemic had on hospital management, a comparison between patients' profiles upon absenteeism before and during the COVID-19 pandemic situation is performed. Results obtained differ between hospital specialities and time periods meaning that patient profiles on absenteeism change during pandemic periods and within specialities.
RESUMO
We consider a high-dimensional linear regression problem. Unlike many papers on the topic, we do not require sparsity of the regression coefficients; instead, our main structural assumption is a decay of eigenvalues of the covariance matrix of the data. We propose a new family of estimators, called the canonical thresholding estimators, which pick largest regression coefficients in the canonical form. The estimators admit an explicit form and can be linked to LASSO and Principal Component Regression (PCR). A theoretical analysis for both fixed design and random design settings is provided. Obtained bounds on the mean squared error and the prediction error of a specific estimator from the family allow to clearly state sufficient conditions on the decay of eigenvalues to ensure convergence. In addition, we promote the use of the relative errors, strongly linked with the out-of-sample R 2. The study of these relative errors leads to a new concept of joint effective dimension, which incorporates the covariance of the data and the regression coefficients simultaneously, and describes the complexity of a linear regression problem. Some minimax lower bounds are established to showcase the optimality of our procedure. Numerical simulations confirm good performance of the proposed estimators compared to the previously developed methods.
RESUMO
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum â 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ â p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ â p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ â d , W ∈ â p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
RESUMO
Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [Ann. Statist.1 (1973) 799-821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cramér-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.
RESUMO
Linear Mixed Effects (LME) models are powerful statistical tools that have been employed in many different real-world applications such as retail data analytics, marketing measurement, and medical research. Statistical inference is often conducted via maximum likelihood estimation with Normality assumptions on the random effects. Nevertheless, for many applications in the retail industry, it is often necessary to consider non-Normal distributions on the random effects when considering the unknown parameters' business interpretations. Motivated by this need, a linear mixed effects model with possibly non-Normal distribution is studied in this research. We propose a general estimating framework based on a saddlepoint approximation (SA) of the probability density function of the dependent variable, which leads to constrained nonlinear optimization problems. The classical LME model with Normality assumption can then be viewed as a special case under the proposed general SA framework. Compared with the existing approach, the proposed method enhances the real-world interpretability of the estimates with satisfactory model fits.
RESUMO
This paper consists of two parts. The first part of the paper is to propose an explicit robust estimation method for the regression coefficients in simple linear regression based on the power-weighted repeated medians technique that has a tuning constant for dealing with the trade-offs between efficiency and robustness. We then investigate the lower and upper bounds of the finite-sample breakdown point of the proposed method. The second part of the paper is to show that based on the linearization of the cumulative distribution function, the proposed method can be applied to obtain robust parameter estimators for the Weibull and Birnbaum-Saunders distributions that are commonly used in both reliability and survival analysis. Numerical studies demonstrate that the proposed method performs well in a manner that is approximately comparable with the ordinary least squares method, whereas it is far superior in the presence of data contamination that occurs frequently in practice.
RESUMO
Despite the vast advantages of making antenatal care visits, the service utilization among pregnant women in Nigeria is suboptimal. A five-year monitoring estimate indicated that about 24% of the women who had live births made no visit. The non-utilization induced excessive zeroes in the outcome of interest. Thus, this study adopted a zero-inflated negative binomial model within a Bayesian framework to identify the spatial pattern and the key factors hindering antenatal care utilization in Nigeria. We overcome the intractability associated with posterior inference by adopting a Pólya-Gamma data-augmentation technique to facilitate inference. The Gibbs sampling algorithm was used to draw samples from the joint posterior distribution. Results revealed that type of place of residence, maternal level of education, access to mass media, household work index, and woman's working status have significant effects on the use of antenatal care services. Findings identified substantial state-level spatial disparity in antenatal care utilization across the country. Cost-effective techniques to achieve an acceptable frequency of utilization include the creation of a community-specific awareness to emphasize the importance and benefits of the appropriate utilization. Special consideration should be given to older pregnant women, women in poor antenatal utilization states, and women residing in poor road network regions.
RESUMO
In many biomedical applications, we are more interested in the predicted probability that a numerical outcome is above a threshold than in the predicted value of the outcome. For example, it might be known that antibody levels above a certain threshold provide immunity against a disease, or a threshold for a disease severity score might reflect conversion from the presymptomatic to the symptomatic disease stage. Accordingly, biomedical researchers often convert numerical to binary outcomes (loss of information) to conduct logistic regression (probabilistic interpretation). We address this bad statistical practice by modelling the binary outcome with logistic regression, modelling the numerical outcome with linear regression, transforming the predicted values from linear regression to predicted probabilities, and combining the predicted probabilities from logistic and linear regression. Analysing high-dimensional simulated and experimental data, namely clinical data for predicting cognitive impairment, we obtain significantly improved predictions of dichotomised outcomes. Thus, the proposed approach effectively combines binary with numerical outcomes to improve binary classification in high-dimensional settings. An implementation is available in the R package cornet on GitHub (https://github.com/rauschenberger/cornet) and CRAN (https://CRAN.R-project.org/package=cornet).
RESUMO
The cure models based on standard distributions like exponential, Weibull, lognormal, Gompertz, gamma, are often used to analyze survival data from cancer clinical trials with long-term survivors. Sometimes, the data is simple, and the standard cure models fit them very well, however, most often the data are complex and the standard cure models don't fit them reasonably well. In this article, we offer a novel generalized Gompertz promotion time cure model and illustrate its fitness to gastric cancer data by three different methods. The generalized Gompertz distribution is as simple as the generalized Weibull distribution and is not computationally as intensive as the generalized F distribution. One detailed real data application is provided for illustration and comparison purposes.
RESUMO
Among the models applied to analyze survival data, a standout is the inverse Gaussian distribution, which belongs to the class of models to analyze positive asymmetric data. However, the variance of this distribution depends on two parameters, which prevents establishing a functional relation with a linear predictor when the assumption of constant variance does not hold. In this context, the aim of this paper is to re-parameterize the inverse Gaussian distribution to enable establishing an association between a linear predictor and the variance. We propose deviance residuals to verify the model assumptions. Some simulations indicate that the distribution of these residuals approaches the standard normal distribution and the mean squared errors of the estimators are small for large samples. Further, we fit the new model to hospitalization times of COVID-19 patients in Piracicaba (Brazil) which indicates that men spend more time hospitalized than women, and this pattern is more pronounced for individuals older than 60 years. The re-parameterized inverse Gaussian model proved to be a good alternative to analyze censored data with non-constant variance.
RESUMO
Distributed interval estimation in linear regression may be computationally infeasible in the presence of big data that are normally stored in different computer servers or in cloud. The existing challenge represents the results from the distributed estimation may still contain redundant information about the population characteristics of the data. To tackle this computing challenge, we develop an optimization procedure to select the best subset from the collection of data subsets, based on which we perform interval estimation in the context of linear regression. The procedure is derived based on minimizing the length of the final interval estimator and maximizing the information remained in the selected data subset, thus is named as the LIC criterion. Theoretical performance of the LIC criterion is studied in this paper together with a simulation study and real data analysis.
RESUMO
Epigenetic alterations are key drivers in the development and progression of cancer. Identifying differentially methylated cytosines (DMCs) in cancer samples is a crucial step toward understanding these changes. In this paper, we propose a trans-dimensional Markov chain Monte Carlo (TMCMC) approach that uses hidden Markov models (HMMs) with binomial emission, and bisulfite sequencing (BS-Seq) data, called DMCTHM, to identify DMCs in cancer epigenetic studies. We introduce the Expander-Collider penalty to tackle under and over-estimation in TMCMC-HMMs. We address all known challenges inherent in BS-Seq data by introducing novel approaches for capturing functional patterns and autocorrelation structure of the data, as well as for handling missing values, multiple covariates, multiple comparisons, and family-wise errors. We demonstrate the effectiveness of DMCTHM through comprehensive simulation studies. The results show that our proposed method outperforms other competing methods in identifying DMCs. Notably, with DMCTHM, we uncovered new DMCs and genes in Colorectal cancer that were significantly enriched in the Tp53 pathway.
RESUMO
A two-parameter unit distribution and its regression model plus its extension to 0 and 1 inflation is introduced and studied. The distribution is called the unit upper truncated Weibull (UUTW) distribution, while the inflated variant is called the 0-1 inflated unit upper truncated Weibull (ZOIUUTW) distribution. The UUTW distribution has an increasing and a J-shaped hazard rate function. The parameters of the proposed models are estimated by the method of maximum likelihood estimation. For the UUTW distribution, two practical examples involving household expenditure and maximum flood level data are used to show its flexibility and the proposed distribution demonstrates better fit tendencies than some of the competing unit distributions. Application of the proposed regression model demonstrates adequate capability in describing the real data set with better modeling proficiency than the existing competing models. Then, for the ZOIUUTW distribution, the CD34+ data involving cancer patients are analyzed to show the flexibility of the model in characterizing inflation at both endpoints of the unit interval.
RESUMO
Association testing has been widely used to study the relationship between genetic variants and phenotypes. Most association testing methods are genotype-based, i.e. first estimate genotype and then regress phenotype on estimated genotype and other variables. Directly testing methods based on next generation sequencing (NGS) data without genotype calling have been proposed and shown advantage over genotype-based methods in the scenarios when genotype calling is not accurate. NGS data-based single-variant testing have been proposed including our previously proposed single-variant testing method, i.e. UNC combo method [1]. NGS data-based group testing methods for continuous phenotype have also been proposed by us using a linear model framework which can handle continuous responses [2]. In this paper, we extend our linear model-based framework to a generalized linear model-based framework so that the methods can handle other types of responses especially binary responses which is commonly-faced in association studies. We have conducted extensive simulation studies to evaluate the performance of different estimators and compare our estimators with their corresponding genotype-based methods. We found that all methods have Type I errors controlled, and our NGS data-based testing methods have better performance than their corresponding genotype-based methods in the literature for other types of responses including binary responses (logistic regression) and count responses (Poisson regression especially when sequencing depth is low. In conclusion, we have extended our previous linear model (LM) framework to a generalized linear model (GLM) framework and derived NGS data-based testing methods for a group of genetic variants. Compared with our previously proposed LM-based methods [2], the new GLM-based methods can handle more complex responses (for example, binary responses and count responses) in addition to continuous responses. Our methods have filled the literature gap and shown advantage over their corresponding genotype-based methods in the literature.
RESUMO
Variable selection is fundamental to high dimensional statistical modeling, and many approaches have been proposed. However, existing variable selection methods do not perform well in presence of outliers in response variable or/and covariates. In order to ensure a high probability of correct selection and efficient parameter estimation, we investigate a robust variable selection method based on a modified Huber's function with an exponential squared loss tail. We also prove that the proposed method has oracle properties. Furthermore, we carry out simulation studies to evaluate the performance of the proposed method for both p
RESUMO
Students' migration mobility is the new form of migration: students migrate to improve their skills and become more valued for the job market. The data regard the migration of Italian Bachelors who enrolled at Master Degree level, moving typically from poor to rich areas. This paper investigates the migration and other possible determinants on the Master Degree students' performance. The Clustering of Effects approach for Quantile Regression Coefficients Modelling has been used to cluster the effects of some variables on the students' performance for three Italian macro-areas. Results show evidence of similarity between Southern and Centre students, with respect to the Northern ones.
RESUMO
We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased under missing completely at random (MCAR) and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further conduct simulation studies to corroborate the findings and demystify what has been missed or misinterpreted in the literature. Some detailed proofs and simulation results are available in the online supplemental materials.
RESUMO
Researchers in statistical shape analysis often analyze outlines of objects. Even though these contours are infinite-dimensional in theory, they must be discretized in practice. When discretizing, it is important to reduce the number of sampling points considerably to reduce computational costs, but to not use too few points so as to result in too much approximation error. Unfortunately, determining the minimum number of points needed to achieve sufficiently approximate the contours is computationally expensive. In this paper, we fit regression models to predict these lower bounds using characteristics of the contours that are computationally cheap as predictor variables. However, least squares regression is inadequate for this task because it treats overestimation and underestimation equally, but underestimation of lower bounds is far more serious. Instead, to fit the models, we use the LINEX loss function, which allows us to penalize underestimation at an exponential rate while penalizing overestimation only linearly. We present a novel approach to select the shape parameter of the loss function and tools for analyzing how well the model fits the data. Through validation methods, we show that the LINEX models work well for reducing the underestimation for the lower bounds.