RESUMO
The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.
Assuntos
Algoritmos , Simulação por Computador , Modelos Estatísticos , Análise de Célula Única , Humanos , Análise de Célula Única/métodos , Análise de Célula Única/estatística & dados numéricos , Análise Fatorial , Dinâmica não LinearRESUMO
Patients are complex and heterogeneous; clinical data sets are complicated by noise, missing data, and the presence of mixed-type data. Using such data sets requires understanding the high-dimensional "space of patients", composed of all measurements that define all relevant phenotypes. The current state-of-the-art merely defines spatial groupings of patients using cluster analyses. Our goal is to apply topological data analysis (TDA), a new unsupervised technique, to obtain a more complete understanding of patient space. We applied TDA to a space of 266 previously untreated patients with Chronic Lymphocytic Leukemia (CLL), using the "daisy" metric to compute distances between clinical records. We found clear evidence for both loops and voids in the CLL data. To interpret these structures, we developed novel computational and graphical methods. The most persistent loop and the most persistent void can be explained using three dichotomized, prognostically important factors in CLL: IGHV somatic mutation status, beta-2 microglobulin, and Rai stage. In conclusion, patient space turns out to be richer and more complex than current models suggest. TDA could become a powerful tool in a researcher's arsenal for interpreting high-dimensional data by providing novel insights into biological processes and improving our understanding of clinical and biological data sets.
RESUMO
This research extends the partially confirmatory approach to accommodate mixed types of data and missingness in a unified framework that can address a wide range of the confirmatory-exploratory continuum in factor analysis. A mix of Bayesian adaptive and covariance Lasso procedures was developed to estimate model parameters and regularize the loading structure and local dependence simultaneously. Several model variants were offered with different constraints for identification. The less-constrained variant can achieve sufficient condition for the more-powerful variant, although loading estimates associated with local dependence can be inflated. Parameter recovery was satisfactory, but the information on local dependence was partially lost with categorical data or missingness. A real-life example illustrated how the models can be used to obtain a more discernible loading pattern and to identify items that do not measure what they are supposed to measure. The proposed methodology has been implemented in the R package LAWBL.
Assuntos
Teorema de Bayes , Análise FatorialRESUMO
We aimed to (1) apply cluster analysis techniques to mixed-type data (numerical and categorical) from baseline neuropsychological standard and widely used assessments of patients with acquired brain injury (ABI) (2) apply state-of-the-art cluster validity indexes (CVI) to assess their internal validity (3) study their external validity considering relevant aspects of ABI rehabilitation such as functional independence measure (FIM) in activities of daily life assessment (4) characterize the identified profiles by using demographic and clinically relevant variables and (5) extend the external validation of the obtained clusters to all cognitive rehabilitation tasks executed by the participants in a web-based cognitive rehabilitation platform (GNPT). We analyzed 1,107 patients with ABI, 58.1% traumatic brain injury (TBI), 21.8% stroke and 20.1% other ABIs (e.g., brain tumors, anoxia, infections) that have undergone inpatient GNPT cognitive rehabilitation from September 2008 to January 2021. We applied the k-prototypes algorithm from the clustMixType R package. We optimized seven CVIs and applied bootstrap resampling to assess clusters stability (fpc R package). Clusters' post hoc comparisons were performed using the Wilcoxon ranked test, paired t-test or Chi-square test when appropriate. We identified a three-clusters optimal solution, with strong stability (>0.85) and structure (e.g., Silhouette > 0.60, Gamma > 0.83), characterized by distinctive level of performance in all neuropsychological tests, demographics, FIM, response to GNPT tasks and tests normative data (e.g., the 3 min cut-off in Trail Making Test-B). Cluster 1 was characterized by severe cognitive impairment (N = 254, 22.9%) the mean age was 47 years, 68.5% patients with TBI and 22% with stroke. Cluster 2 was characterized by mild cognitive impairment (N = 376, 33.9%) mean age 54 years, 53.5% patients with stroke and 27% other ABI. Cluster 3, moderate cognitive impairment (N = 477, 43.2%) mean age 33 years, 83% patients with TBI and 14% other ABI. Post hoc analysis on cognitive FIM supported a significant higher performance of Cluster 2 vs. Cluster 3 (p < 0.001), Cluster 2 vs. Cluster 1 (p < 0.001) and Cluster 3 vs. Cluster 1 (p < 0.001). All patients executed 286,798 GNPT tasks, with performance significantly higher in Cluster 2 and 3 vs. Cluster 1 (p < 0.001).
RESUMO
In modern biomedical research, the data often contain a large number of variables of mixed data types (continuous, multi-categorical, or binary) but on some variables observations are missing. Imputation is a common solution when the downstream analyses require a complete data matrix. Several imputation methods are available that work under specific distributional assumptions. We propose an improvement over the popular non-parametric nearest neighbor imputation method which requires no particular assumptions. The proposed method makes practical and effective use of the information on the association among the variables. In particular, we propose a weighted version of the Lq distance for mixed-type data, which uses the information from a subset of important variables only. The performance of the proposed method is investigated using a variety of simulated and real data from different areas of application. The results show that the proposed methods yield smaller imputation error and better performance when compared to other approaches. It is also shown that the proposed imputation method works efficiently even when the number of samples is smaller than the number of variables.
Assuntos
Algoritmos , Pesquisa Biomédica , Análise por Conglomerados , Projetos de PesquisaRESUMO
AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.
RESUMO
OBJECTIVE: Unsupervised machine learning approaches hold promise for large-scale clinical data. However, the heterogeneity of clinical data raises new methodological challenges in feature selection, choosing a distance metric that captures biological meaning, and visualization. We hypothesized that clustering could discover prognostic groups from patients with chronic lymphocytic leukemia, a disease that provides biological validation through well-understood outcomes. METHODS: To address this challenge, we applied k-medoids clustering with 10 distance metrics to 2 experiments ("A" and "B") with mixed clinical features collapsed to binary vectors and visualized with both multidimensional scaling and t-stochastic neighbor embedding. To assess prognostic utility, we performed survival analysis using a Cox proportional hazard model, log-rank test, and Kaplan-Meier curves. RESULTS: In both experiments, survival analysis revealed a statistically significant association between clusters and survival outcomes (A: overall survival, P = .0164; B: time from diagnosis to treatment, P = .0039). Multidimensional scaling separated clusters along a gradient mirroring the order of overall survival. Longer survival was associated with mutated immunoglobulin heavy-chain variable region gene (IGHV) status, absent Zap 70 expression, female sex, and younger age. CONCLUSIONS: This approach to mixed-type data handling and selection of distance metric captured well-understood, binary, prognostic markers in chronic lymphocytic leukemia (sex, IGHV mutation status, ZAP70 expression status) with high fidelity.
Assuntos
Cadeias Pesadas de Imunoglobulinas/genética , Leucemia Linfocítica Crônica de Células B/mortalidade , Mutação , Aprendizado de Máquina não Supervisionado , Proteína-Tirosina Quinase ZAP-70/metabolismo , Adulto , Idoso , Idoso de 80 Anos ou mais , Feminino , Humanos , Estimativa de Kaplan-Meier , Leucemia Linfocítica Crônica de Células B/imunologia , Leucemia Linfocítica Crônica de Células B/metabolismo , Masculino , Pessoa de Meia-Idade , Prognóstico , Modelos de Riscos ProporcionaisRESUMO
We describe a mixed-effects model for nonnegative continuous cross-sectional data in a two-part modelling framework. A potentially endogenous binary variable is included in the model specification and association between the outcomes is modeled through a (discrete) latent structure. We show how model parameters can be estimated in a finite mixture context, allowing for skewness, multivariate association between random effects and endogeneity. The model behavior is investigated through a large-scale simulation experiment. The proposed model is computationally parsimonious and seems to produce acceptable results even if the underlying random effects structure follows a continuous parametric (e.g. Gaussian) distribution. The proposed approach is motivated by the analysis of a sample taken from the Medical Expenditure Panel Survey. The analyzed outcome, that is ambulatory health expenditure, is a mixture of zeros and continuous values. The effects of socio-demographic characteristics on health expenditure are investigated and, as a by-product of the estimation procedure, two subpopulations (i.e. high and low users) are identified.