Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
J Appl Stat ; 51(10): 1976-2006, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39071252

RESUMO

The problems of point estimation and classification under the assumption that the training data follow a Lindley distribution are considered. Bayes estimators are derived for the parameter of the Lindley distribution applying the Markov chain Monte Carlo (MCMC), and Tierney and Kadane's [Tierney and Kadane, Accurate approximations for posterior moments and marginal densities, J. Amer. Statist. Assoc. 81 (1986), pp. 82-86] methods. In the sequel, we prove that the Bayes estimators using Tierney and Kadane's approximation and Lindley's approximation both converge to the maximum likelihood estimator (MLE), as n → ∞ , where n is the sample size. The performances of all the proposed estimators are compared with some of the existing ones using bias and mean squared error (MSE), numerically. It has been noticed from our simulation study that the proposed estimators perform better than some of the existing ones. Applying these estimators, we construct several plug-in type classification rules and a rule that uses the likelihood accordance function. The performances of each of the rules are numerically evaluated using the expected probability of misclassification (EPM). Two real-life examples related to COVID-19 disease are considered for illustrative purposes.

2.
J Appl Stat ; 51(9): 1756-1771, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38933137

RESUMO

In many biomedical applications, we are more interested in the predicted probability that a numerical outcome is above a threshold than in the predicted value of the outcome. For example, it might be known that antibody levels above a certain threshold provide immunity against a disease, or a threshold for a disease severity score might reflect conversion from the presymptomatic to the symptomatic disease stage. Accordingly, biomedical researchers often convert numerical to binary outcomes (loss of information) to conduct logistic regression (probabilistic interpretation). We address this bad statistical practice by modelling the binary outcome with logistic regression, modelling the numerical outcome with linear regression, transforming the predicted values from linear regression to predicted probabilities, and combining the predicted probabilities from logistic and linear regression. Analysing high-dimensional simulated and experimental data, namely clinical data for predicting cognitive impairment, we obtain significantly improved predictions of dichotomised outcomes. Thus, the proposed approach effectively combines binary with numerical outcomes to improve binary classification in high-dimensional settings. An implementation is available in the R package cornet on GitHub (https://github.com/rauschenberger/cornet) and CRAN (https://CRAN.R-project.org/package=cornet).

3.
J Multivar Anal ; 2022024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38433779

RESUMO

Network estimation has been a critical component of high-dimensional data analysis and can provide an understanding of the underlying complex dependence structures. Among the existing studies, Gaussian graphical models have been highly popular. However, they still have limitations due to the homogeneous distribution assumption and the fact that they are only applicable to small-scale data. For example, cancers have various levels of unknown heterogeneity, and biological networks, which include thousands of molecular components, often differ across subgroups while also sharing some commonalities. In this article, we propose a new joint estimation approach for multiple networks with unknown sample heterogeneity, by decomposing the Gaussian graphical model (GGM) into a collection of sparse regression problems. A reparameterization technique and a composite minimax concave penalty are introduced to effectively accommodate the specific and common information across the networks of multiple subgroups, making the proposed estimator significantly advancing from the existing heterogeneity network analysis based on the regularized likelihood of GGM directly and enjoying scale-invariant, tuning-insensitive, and optimization convexity properties. The proposed analysis can be effectively realized using parallel computing. The estimation and selection consistency properties are rigorously established. The proposed approach allows the theoretical studies to focus on independent network estimation only and has the significant advantage of being both theoretically and computationally applicable to large-scale data. Extensive numerical experiments with simulated data and the TCGA breast cancer data demonstrate the prominent performance of the proposed approach in both subgroup and network identifications.

4.
J Appl Stat ; 51(3): 407-429, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38370271

RESUMO

The problem of classification into two inverse Gaussian populations with a common mean and ordered scale-like parameters is considered. Surprisingly, the maximum likelihood estimators (MLEs) of the associated model parameters have not been utilized for classification purposes. Note that the MLEs of the model parameters, including the MLE of the common mean, do not have closed-form expressions. In this paper, several classification rules are proposed that use the MLEs and some plug-in type estimators under order restricted scale-like parameters. In the sequel, the risk values of all the proposed estimators are compared numerically, which shows that the proposed plug-in type restricted MLE performs better than others, including the Graybill-Deal type estimator of the common mean. Further, the proposed classification rules are compared in terms of the expected probability of correct classification (EPC) numerically. It is seen that some of our proposed rules have better performance than the existing ones in most of the parameter space. Two real-life examples are considered for application purposes.

5.
J Appl Stat ; 50(9): 1962-1979, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37378266

RESUMO

Clustering analysis is a prevalent statistical method which divides populations into several subgroups of similar units. However, most existing clustering methods require complete data. One general method that addresses incomplete data is multiple imputation (MI) which avoids many limitations found in other single imputation-based methods and complete case analyses. Nevertheless, adopting MI framework to clustering analysis can be challenging since each imputed data might consist of a different number of clusters and there is not a unique parameter for clustering analysis. In response to this problem, we have developed MICA: Multiply Imputed Cluster Analysis. MICA is a framework for clustering incomplete data consisting of two clustering stages. We assess the properties of MICA and its superiority over other existing incomplete clustering strategies based on a simulation study under various data structures. In addition, we demonstrate the usage of MICA by applying it to the Youth Risk Behavior Surveillance System (YRBSS) 2019 data.

6.
Stat Interface ; 16(2): 319-335, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37193362

RESUMO

This article presents a novel approach to clustering and feature selection for categorical time series via interpretable frequency-domain features. A distance measure is introduced based on the spectral envelope and optimal scalings, which parsimoniously characterize prominent cyclical patterns in categorical time series. Using this distance, partitional clustering algorithms are introduced for accurately clustering categorical time series. These adaptive procedures offer simultaneous feature selection for identifying important features that distinguish clusters and fuzzy membership when time series exhibit similarities to multiple clusters. Clustering consistency of the proposed methods is investigated, and simulation studies are used to demonstrate clustering accuracy with various underlying group structures. The proposed methods are used to cluster sleep stage time series for sleep disorder patients in order to identify particular oscillatory patterns associated with sleep disruption.

7.
J Appl Stat ; 50(7): 1496-1514, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37197752

RESUMO

Accounting for important interaction effects can improve the prediction of many statistical learning models. Identification of relevant interactions, however, is a challenging issue owing to their ultrahigh-dimensional nature. Interaction screening strategies can alleviate such issues. However, due to heavier tail distribution and complex dependence structure of interaction effects, innovative robust and/or model-free methods for screening interactions are required to better scale analysis of complex and high-throughput data. In this work, we develop a new model-free interaction screening method, termed Kendall Interaction Filter (KIF), for the classification in high-dimensional settings. KIF method suggests a weighted-sum measure, which compares the overall to the within-cluster Kendall's τ of pairs of predictors, to select interactive couples of features. The proposed KIF measure captures relevant interactions for the clusters response-variable, handles continuous, categorical or a mixture of continuous-categorical features, and is invariant under monotonic transformations. The tKIF measure enjoys the sure screening property in the high-dimensional setting under mild conditions, without imposing sub-exponential moment assumptions on the features' distribution. We illustrate the favorable behavior of the proposed methodology compared to the methods in the same category using simulation studies, and we conduct real data analyses to demonstrate its utility.

8.
J Appl Stat ; 50(4): 909-926, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36925906

RESUMO

This paper presents a new method called the functional distributional clustering algorithm (FDCA) that seeks to identify spatially contiguous clusters and incorporate changes in temporal patterns across overcrowded networks. This method is motivated by a graph-based network composed of sensors arranged over space where recorded observations for each sensor represent a multi-modal distribution. The proposed method is fully non-parametric and generates clusters within an agglomerative hierarchical clustering approach based on a measure of distance that defines a cumulative distribution function over temporal changes for different locations in space. Traditional hierarchical clustering algorithms that are spatially adapted do not typically accommodate the temporal characteristics of the underlying data. The effectiveness of the FDCA is illustrated using an application to both empirical and simulated data from about 400 sensors in a 2.5 square miles network area in downtown San Francisco, California. The results demonstrate the superior ability of the the FDCA in identifying true clusters compared to functional only and distributional only algorithms and similar performance to a model-based clustering algorithm.

9.
J Appl Stat ; 50(3): 675-690, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36819077

RESUMO

The current large amounts of data and advanced technologies have produced new types of complex data, such as histogram-valued data. The paper focuses on classification problems when predictors are observed as or aggregated into histograms. Because conventional classification methods take vectors as input, a natural approach converts histograms into vector-valued data using summary values, such as the mean or median. However, this approach forgoes the distributional information available in histograms. To address this issue, we propose a margin-based classifier called support histogram machine (SHM) for histogram-valued data. We adopt the support vector machine framework and the Wasserstein-Kantorovich metric to measure distances between histograms. The proposed optimization problem is solved by a dual approach. We then test the proposed SHM via simulated and real examples and demonstrate its superior performance to summary-value-based methods.

10.
Ann Stat ; 50(1): 487-510, 2022 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35813218

RESUMO

In long-term follow-up studies, data are often collected on repeated measures of multivariate response variables as well as on time to the occurrence of a certain event. To jointly analyze such longitudinal data and survival time, we propose a general class of semiparametric latent-class models that accommodates a heterogeneous study population with flexible dependence structures between the longitudinal and survival outcomes. We combine nonparametric maximum likelihood estimation with sieve estimation and devise an efficient EM algorithm to implement the proposed approach. We establish the asymptotic properties of the proposed estimators through novel use of modern empirical process theory, sieve estimation theory, and semiparametric efficiency theory. Finally, we demonstrate the advantages of the proposed methods through extensive simulation studies and provide an application to the Atherosclerosis Risk in Communities study.

11.
J Appl Stat ; 49(4): 1018-1032, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35707809

RESUMO

Students' migration mobility is the new form of migration: students migrate to improve their skills and become more valued for the job market. The data regard the migration of Italian Bachelors who enrolled at Master Degree level, moving typically from poor to rich areas. This paper investigates the migration and other possible determinants on the Master Degree students' performance. The Clustering of Effects approach for Quantile Regression Coefficients Modelling has been used to cluster the effects of some variables on the students' performance for three Italian macro-areas. Results show evidence of similarity between Southern and Centre students, with respect to the Northern ones.

12.
J Multivar Anal ; 1892022 May.
Artigo em Inglês | MEDLINE | ID: mdl-36817965

RESUMO

In biomedical data analysis, clustering is commonly conducted. Biclustering analysis conducts clustering in both the sample and covariate dimensions and can more comprehensively describe data heterogeneity. In most of the existing biclustering analyses, scalar measurements are considered. In this study, motivated by time-course gene expression data and other examples, we take the "natural next step" and consider the biclustering analysis of functionals under which, for each covariate of each sample, a function (to be exact, its values at discrete measurement points) is present. We develop a doubly penalized fusion approach, which includes a smoothness penalty for estimating functionals and, more importantly, a fusion penalty for clustering. Statistical properties are rigorously established, providing the proposed approach a strong ground. We also develop an effective ADMM algorithm and accompanying R code. Numerical analysis, including simulations, comparisons, and the analysis of two time-course gene expression data, demonstrates the practical effectiveness of the proposed approach.

13.
J Appl Stat ; 48(10): 1833-1860, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35706708

RESUMO

We propose a novel method to quantify the similarity between an impression (Q) from an unknown source and a test impression (K) from a known source. Using the property of geometrical congruence in the impressions, the degree of correspondence is quantified using ideas from graph theory and maximum clique (MC). The algorithm uses the x and y coordinates of the edges in the images as the data. We focus on local areas in Q and the corresponding regions in K and extract features for comparison. Using pairs of images with known origin, we train a random forest to classify pairs into mates and non-mates. We collected impressions from 60 pairs of shoes of the same brand and model, worn over six months. Using a different set of very similar shoes, we evaluated the performance of the algorithm in terms of the accuracy with which it correctly classified images into source classes. Using classification error rates and ROC curves, we compare the proposed method to other algorithms in the literature and show that for these data, our method shows good classification performance relative to other methods. The algorithm can be implemented with the R package shoeprintr.

14.
J Multivar Anal ; 1752020 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-32063658

RESUMO

We propose a new class of generalized linear mixed models with Gaussian mixture random effects for clustered data. To overcome the weak identifiability issues, we fit the model using a penalized Expectation Maximization (EM) algorithm, and develop sequential locally restricted likelihood ratio tests to determine the number of components in the Gaussian mixture. Our work is motivated by an application to nationwide kidney transplant center evaluation in the United States, where the patient-level post-surgery outcomes are repeated measures of the care quality of the transplant centers. By taking into account patient-level risk factors and modeling the center effects by a finite Gaussian mixture model, the proposed model provides a convenient framework to study the heterogeneity among the transplant centers and controls the false discovery rate when screening for transplant centers with non-standard performance.

15.
J Appl Stat ; 47(10): 1739-1756, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-35707136

RESUMO

We consider the clustering of repeatedly measured 'min-max' type interval-valued data. We read the data as matrix variate data and assume the covariance matrix is separable for the model-based clustering (M-clustering). The use of a separable covariance matrix introduces several advantages in M-clustering, which include fewer samples required for a valid procedure. In addition, the numerical study shows that this structured matrix allows us to find the correct number of clusters more accurately compared to other commonly assumed covariance matrices. We apply the M-clustering with various covariance structures to clustering the longitudinal blood pressure data from the National Heart, Lung, and Blood Institute Growth and Health Study (NGHS).

16.
J Appl Stat ; 47(13-15): 2895-2911, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-35707410

RESUMO

Spatial sign and rank-based methods have been studied in the recent literature, especially when the dimension is smaller than the sample size. In this paper, a classification method based on the distribution of rank functions for high-dimensional data is considered with extension to functional data. The method is fully nonparametric in nature. The performance of the classification method is illustrated in comparison with some other classifiers using simulated and real data sets. Supporting code in R are provided for computational implementation of the classification method that will be of use to others.

17.
J Appl Stat ; 47(13-15): 2328-2353, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-35707426

RESUMO

A correct modelization of the insurance losses distribution is crucial in the insurance industry. This distribution is generally highly positively skewed, unimodal hump-shaped, and with a heavy right tail. Compound models are a profitable way to accommodate situations in which some of the probability masses are shifted to the tails of the distribution. Therefore, in this work, a general approach to compound unimodal hump-shaped distributions with a mixing dichotomous distribution is introduced. A 2-parameter unimodal hump-shaped distribution, defined on a positive support, is considered and reparametrized with respect to the mode and to another parameter related to the distribution variability. The compound is performed by scaling the latter parameter by means of a dichotomous mixing distribution that governs the tail behavior of the resulting model. The proposed model can also allow for automatic detection of typical and atypical losses via a simple procedure based on maximum a posteriori probabilities. Unimodal gamma and log-normal are considered as examples of unimodal hump-shaped distributions. The resulting models are firstly evaluated in a sensitivity study and then fitted to two real insurance loss datasets, along with several well-known competitors. Likelihood-based information criteria and risk measures are used to compare the models.

18.
J Appl Stat ; 47(16): 2941-2960, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-35707710

RESUMO

Variable selection in finite mixture of regression (FMR) models is frequently used in statistical modeling. The majority of applications of variable selection in FMR models use a normal distribution for regression error. Such assumptions are unsuitable for a set of data containing a group or groups of observations with asymmetric behavior. In this paper, we introduce a variable selection procedure for FMR models using the skew-normal distribution. With appropriate choice of the tuning parameters, we establish the theoretical properties of our procedure, including consistency in variable selection and the oracle property in estimation. To estimate the parameters of the model, a modified EM algorithm for numerical computations is developed. The methodology is illustrated through numerical experiments and a real data example.

19.
Ann Stat ; 48(1): 111-137, 2020 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35847529

RESUMO

The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X 1, … , X p ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

20.
Ann Appl Stat ; 13(2): 1103-1127, 2019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-33381253

RESUMO

The advent of high-throughput sequencing technologies has led to an increasing availability of large multi-tissue data sets which contain gene expression measurements across different tissues and individuals. In this setting, variation in expression levels arises due to contributions specific to genes, tissues, individuals, and interactions thereof. Classical clustering methods are ill-suited to explore these three-way interactions and struggle to fully extract the insights into transcriptome complexity contained in the data. We propose a new statistical method, called MultiCluster, based on semi-nonnegative tensor decomposition which permits the investigation of transcriptome variation across individuals and tissues simultaneously. We further develop a tensor projection procedure which detects covariate-related genes with high power, demonstrating the advantage of tensor-based methods in incorporating information across similar tissues. Through simulation and application to the GTEx RNA-seq data from 53 human tissues, we show that MultiCluster identifies three-way interactions with high accuracy and robustness.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA