Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.115
Filtrar
1.
Spectrochim Acta A Mol Biomol Spectrosc ; 324: 124998, 2025 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-39178690

RESUMEN

Soil potassium is a crucial nutrient element necessary for crop growth, and its efficient measurement has become essential for developing rational fertilization plans and optimizing crop growth benefits. At present, data mining technology based on near-infrared (NIR) spectroscopy analysis has proven to be a powerful tool for real-time monitoring of soil potassium content. However, as technology and instruments improve, the curse of the dimensionality problem also increases accordingly. Therefore, it is urgent to develop efficient variable selection methods suitable for NIR spectroscopy analysis techniques. In this study, we proposed a three-step progressive hybrid variable selection strategy, which fully leveraged the respective strengths of several high-performance variable selection methods. By sequentially equipping synergy interval partial least squares (SiPLS), the random forest variable importance measurement (RF(VIM)), and the improved mean impact value algorithm (IMIV) into a fusion framework, a soil important potassium variable selection method was proposed, termed as SiPLS-RF(VIM)-IMIV. Finally, the optimized variables were fitted into a partial least squares (PLS) model. Experimental results demonstrated that the PLS model embedded with the hybrid strategy effectively improved the prediction performance while reducing the model complexity. The RMSET and RT on the test set were 0.01181% and 0.88246, respectively, better than the RMSET and RT of the full spectrum PLS, SiPLS, and SiPLS-RF(VIM) methods. This study demonstrated that the hybrid strategy established based on the combination of NIR spectroscopy data and the SiPLS-RF(VIM)-IMIV method could quantitatively analyze soil potassium content levels and potentially solve other issues of data-driven soil dynamic monitoring.

2.
Am Stat ; 78(3): 318-326, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39386318

RESUMEN

Observational studies of treatment effects require adjustment for confounding variables. However, causal inference methods typically cannot deliver perfect adjustment on all measured baseline variables, and there is often ambiguity about which variables should be prioritized. Standard prioritization methods based on treatment imbalance alone neglect variables' relationships with the outcome. We propose the joint variable importance plot to guide variable prioritization for observational studies. Since not all variables are equally relevant to the outcome, the plot adds outcome associations to quantify the potential confounding jointly with the standardized mean difference. To enhance comparisons on the plot between variables with different confounding relationships, we also derive and plot bias curves. Variable prioritization using the plot can produce recommended values for tuning parameters in many existing matching and weighting methods. We showcase the use of the joint variable importance plots in the design of a balance-constrained matched study to evaluate whether taking an antidiabetic medication, glyburide, increases the incidence of C-section delivery among pregnant individuals with gestational diabetes.

3.
Biometrics ; 80(4)2024 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-39377518

RESUMEN

In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates, including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors.


Asunto(s)
Teorema de Bayes , Simulación por Computador , Microbioma Gastrointestinal , Obesidad , Humanos , Análisis de Regresión , Modelos Estadísticos
4.
Sci Total Environ ; 954: 176669, 2024 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-39362558

RESUMEN

Recognition of biotic and abiotic factors affecting biomass of natural mixed forests is of great importance for forest carbon estimation and management. When estimating stand biomass using models, different variable selection methods often yield inconsistent results, and there is lack of systematic analysis. This study aimed to combine multiple feature selection methods with structural equation modelling (SEM) to identify a set of variables affecting stand biomass more reasonably. Eight methods were applied for feature selection based on data from 286 permanent sample plots in natural coniferous-broad leaved mixed forests in northeast China. These methods included Pearson correlation analysis, two methods derived from principal component analysis (PCA), stepwise regression, redundancy analysis (RDA), generalized additive model (GAM), random forest (RF), and boosted regression tree (BRT). A total of 56 candidate variables were considered, covering stand, biodiversity, climate and soil features. Significant variability was observed in the variables selected, however, there were 6 variables consistently identified across all methods, including tree species diversity (N_Sp_Div), stand structural diversity (N_ Size_ Div), nearest taxon index (NRI), community weighted mean based on dry matter mass of leaves (CWM.LDMC), soil pH, and degree-days above 18 °C (DD18). Then, these variables were included in the SEM with stand average age and additive stand density index (aSDI) to explore the direction and magnitude of their impacts on stand biomass. The SEM results showed that aSDI and average age had the greatest positive effects on stand biomass, and structural diversity also had a significant positive effect. DD18 affected stand biomass both directly and indirectly, with the total negative effect. Soil pH indirectly affected stand biomass via aSDI. Our findings demonstrated that combining multiple feature selection methods with SEM was an effective approach for understanding multiple factors affecting stand biomass, and provided valuable insights for forest biomass estimation and carbon management.

5.
Genet Epidemiol ; 2024 Oct 06.
Artículo en Inglés | MEDLINE | ID: mdl-39370608

RESUMEN

The main goal of fine-mapping is the identification of relevant genetic variants that have a causal effect on some trait of interest, such as the presence of a disease. From a statistical point of view, fine mapping can be seen as a variable selection problem. Fine-mapping methods are often challenging to apply because of the presence of linkage disequilibrium (LD), that is, regions of the genome where the variants interrogated have high correlation. Several methods have been proposed to address this issue. Here we explore the 'Sum of Single Effects' (SuSiE) method, applied to real data (summary statistics) from a genome-wide meta-analysis of the autoimmune liver disease primary biliary cholangitis (PBC). Fine-mapping in this data set was previously performed using the FINEMAP program; we compare these previous results with those obtained from SuSiE, which provides an arguably more convenient and principled way of generating 'credible sets', that is set of predictors that are correlated with the response variable. This allows us to appropriately acknowledge the uncertainty when selecting the causal effects for the trait. We focus on the results from SuSiE-RSS, which fits the SuSiE model to summary statistics, such as z-scores, along with a correlation matrix. We also compare the SuSiE results to those obtained using a more recently developed method, h2-D2, which uses the same inputs. Overall, we find the results from SuSiE-RSS and, to a lesser extent, h2-D2, to be quite concordant with those previously obtained using FINEMAP. The resulting genes and biological pathways implicated are therefore also similar to those previously obtained, providing valuable confirmation of these previously reported results. Detailed examination of the credible sets identified suggests that, although for the majority of the loci (33 out of 56) the results from SuSiE-RSS seem most plausible, there are some loci (5 out of 56 loci) where the results from h2-D2 seem more compelling. Computer simulations suggest that, overall, SuSiE-RSS generally has slightly higher power, better precision, and better ability to identify the true number of causal variants in a region than h2-D2, although there are some scenarios where the power of h2-D2 is higher. Thus, in real data analysis, the use of complementary approaches such as both SuSiE and h2-D2 is potentially warranted.

6.
NeuroRehabilitation ; 55(2): 155-167, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39302390

RESUMEN

BACKGROUND: Hispanics are the largest growing ethnic minority group in the U.S. Despite significant progress in providing norms for this population, updated normative data are essential. OBJECTIVE: To present the methodology for a study generating normative neuropsychological test data for Spanish-speaking adults living in the U.S. using Bayesian inference as a novel approach. METHODS: The sample consisted of 253 healthy adults from eight U.S. regions, with individuals originating from a diverse array of Latin American countries. To participate, individuals must have met the following criteria: were between 18 and 80 years of age, had lived in the U.S. for at least 1 year, self-identified Spanish as their dominant language, had at least one year of formal education, were able to read and write in Spanish at the time of evaluation, scored≥23 on the Mini-Mental State Examination, <10 on the Patient Health Questionnaire- 9, and <10 on the Generalized Anxiety Disorder scale. Participants completed 12 neuropsychological tests. Reliability statistics and norms were calculated for all tests. CONCLUSION: This is the first normative study for Spanish-speaking adults in the U.S. that uses Bayesian linear or generalized linear regression models for generating norms in neuropsychology, implementing sociocultural measures as possible covariates.


Asunto(s)
Teorema de Bayes , Hispánicos o Latinos , Pruebas Neuropsicológicas , Humanos , Adulto , Masculino , Femenino , Persona de Mediana Edad , Estados Unidos , Anciano , Pruebas Neuropsicológicas/estadística & datos numéricos , Pruebas Neuropsicológicas/normas , Adulto Joven , Valores de Referencia , Adolescente , Anciano de 80 o más Años , Lenguaje , Reproducibilidad de los Resultados
7.
Entropy (Basel) ; 26(9)2024 Sep 16.
Artículo en Inglés | MEDLINE | ID: mdl-39330127

RESUMEN

Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the "unpenalized loss function + penalty term" formulation for regularization methods and the "likelihood function × shrinkage prior" framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.

8.
Entropy (Basel) ; 26(9)2024 Sep 19.
Artículo en Inglés | MEDLINE | ID: mdl-39330134

RESUMEN

One of the primary issues that arises in statistical modeling pertains to the assessment of the relative importance of each variable in the model. A variety of techniques have been proposed to quantify variable importance for regression models. However, in the context of best subset selection, fewer satisfactory methods are available. With this motivation, we here develop a variable importance measure expressly for this setting. We investigate and illustrate the properties of this measure, introduce algorithms for the efficient computation of its values, and propose a procedure for calculating p-values based on its sampling distributions. We present multiple simulation studies to examine the properties of the proposed methods, along with an application to demonstrate their practical utility.

9.
Spectrochim Acta A Mol Biomol Spectrosc ; 326: 125195, 2024 Sep 23.
Artículo en Inglés | MEDLINE | ID: mdl-39340947

RESUMEN

Microplastics, as emerging environmental pollutants, have garnered considerable attention due to their contamination of both the environment and food. Microplastics can infiltrate the human food chain through multiple pathways, potentially posing health risks to humans. Currently, non-destructive testing of microplastics in food is considered challenging. This study aims to investigate the feasibility of employing a portable Raman spectroscopy system for non-destructive detection of microplastic content (polystyrene, PS; polyethylene, PE) in flour. In this study, a portable spectrometer was used to collect flour spectra of different abundances of microplastics. To enhance the predictive performance of the partial least squares (PLS) model, a mixed variable selection strategy that combined the wavelength interval selection method (Synergy interval partial least squares, siPLS) and the wavelength point selection method (Least absolute shrinkage and selection operator, LASSO; Multiple feature-spaces ensemble by least absolute shrinkage and selection operator, MFE-LASSO) was proposed. Four regression models (PLS, siPLS, siPLS-LASSO, siPLS-MFE-LASSO) were developed and compared for detecting PS and PE content in flour. The siPLS-MFE-LASSO model exhibited the best generalization performance in the prediction set, and was considered to have the best generalization performance (PS: RP2 = 0.9889, RMSEP=0.0344 %; PE: RP2 = 0.9878, RMSEP=0.0361 %). In conclusion, this study has demonstrated the potential of using a portable Raman spectrometer in conjunction with a mixed variable selection algorithm for non-destructive detection of PS and PE content in flour, providing more possibilities for non-destructive detection of microplastic content in food.

10.
Ann Appl Stat ; 18(2): 1360-1377, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-39328363

RESUMEN

Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the R 2 measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.

11.
PeerJ ; 12: e18186, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39346075

RESUMEN

Purpose: Timely and accurate monitoring of soil salinity content (SSC) is essential for precise irrigation management of large-scale farmland. Uncrewed aerial vehicle (UAV) low-altitude remote sensing with high spatial and temporal resolution provides a scientific and effective technical means for SSC monitoring. Many existing soil salinity inversion models have only been tested by a single variable selection method or machine learning algorithm, and the influence of variable selection method combined with machine learning algorithm on the accuracy of soil salinity inversion remain further studied. Methods: Firstly, based on UAV multispectral remote sensing data, by extracting the spectral reflectance of each sampling point to construct 30 spectral indexes, and using the pearson correlation coefficient (PCC), gray relational analysis (GRA), variable projection importance (VIP), and support vector machine-recursive feature elimination (SVM-RFE) to screen spectral index and realize the selection of sensitive variables. Subsequently, screened and unscreened variables as model input independent variables, constructed 20 soil salinity inversion models based on the support vector machine regression (SVM), back propagation neural network (BPNN), extreme learning machine (ELM), and random forest (RF) machine learning algorithms, the aim is to explore the feasibility of different variable selection methods combined with machine learning algorithms in SSC inversion of crop-covered farmland. To evaluate the performance of the soil salinity inversion model, the determination coefficient (R2), root mean square error (RMSE) and performance deviation ratio (RPD) were used to evaluate the model performance, and determined the best variable selection method and soil salinity inversion model by taking alfalfa covered farmland in arid oasis irrigation areas of China as the research object. Results: The variable selection combined with machine learning algorithm can significantly improve the accuracy of remote sensing inversion of soil salinity. The performance of the models has been improved markedly using the four variable selection methods, and the applicability varied among the four methods, the GRA variable selection method is suitable for SVM, BPNN, and ELM modeling, while the PCC method is suitable for RF modeling. The GRA-SVM is the best soil salinity inversion model in alfalfa cover farmland, with Rv 2 of 0.8888, RMSEv of 0.1780, and RPD of 1.8115 based on the model verification dataset, and the spatial distribution map of soil salinity can truly reflect the degree of soil salinization in the study area. Conclusion: Based on our findings, the variable selection combined with machine learning algorithm is an effective method to improve the accuracy of soil salinity remote sensing inversion, which provides a new approach for timely and accurate acquisition of crops covered farmland soil salinity information.


Asunto(s)
Aprendizaje Automático , Medicago sativa , Salinidad , Suelo , Máquina de Vectores de Soporte , Suelo/química , Medicago sativa/crecimiento & desarrollo , Algoritmos , Tecnología de Sensores Remotos/métodos , Monitoreo del Ambiente/métodos , China , Granjas , Redes Neurales de la Computación
13.
Can J Stat ; 52(3): 900-923, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39319323

RESUMEN

When analyzing data combined from multiple sources (e.g., hospitals, studies), the heterogeneity across different sources must be accounted for. In this paper, we consider high-dimensional linear regression models for integrative data analysis. We propose a new adaptive clustering penalty (ACP) method to simultaneously select variables and cluster source-specific regression coefficients with sub-homogeneity. We show that the estimator based on the ACP method enjoys a strong oracle property under certain regularity conditions. We also develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) for parameter estimation. We conduct simulation studies to compare the performance of the proposed method to three existing methods (a fused LASSO with adjacent fusion, a pairwise fused LASSO, and a multi-directional shrinkage penalty method). Finally, we apply the proposed method to the multi-center Childhood Adenotonsillectomy Trial to identify sub-homogeneity in the treatment effects across different study sites.


Insérer votre résumé ici. We will supply a French abstract for those authors who can't prepare it themselves.

14.
Bioinform Biol Insights ; 18: 11779322241271535, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39286768

RESUMEN

Tumor heterogeneity is a challenge to designing effective and targeted therapies. Glioma-type identification depends on specific molecular and histological features, which are defined by the official World Health Organization (WHO) classification of the central nervous system (CNS). These guidelines are constantly updated to support the diagnosis process, which affects all the successive clinical decisions. In this context, the search for new potential diagnostic and prognostic targets, characteristic of each glioma type, is crucial to support the development of novel therapies. Based on The Cancer Genome Atlas (TCGA) glioma RNA-sequencing data set updated according to the 2016 and 2021 WHO guidelines, we proposed a 2-step variable selection approach for biomarker discovery. Our framework encompasses the graphical lasso algorithm to estimate sparse networks of genes carrying diagnostic information. These networks are then used as input for regularized Cox survival regression model, allowing the identification of a smaller subset of genes with prognostic value. In each step, the results derived from the 2016 and 2021 classes were discussed and compared. For both WHO glioma classifications, our analysis identifies potential biomarkers, characteristic of each glioma type. Yet, better results were obtained for the WHO CNS classification in 2021, thereby supporting recent efforts to include molecular data on glioma classification.

15.
Biometrics ; 80(3)2024 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-39282732

RESUMEN

We develop a methodology for valid inference after variable selection in logistic regression when the responses are partially observed, that is, when one observes a set of error-prone testing outcomes instead of the true values of the responses. Aiming at selecting important covariates while accounting for missing information in the response data, we apply the expectation-maximization algorithm to compute maximum likelihood estimators subject to LASSO penalization. Subsequent to variable selection, we make inferences on the selected covariate effects by extending post-selection inference methodology based on the polyhedral lemma. Empirical evidence from our extensive simulation study suggests that our post-selection inference results are more reliable than those from naive inference methods that use the same data to perform variable selection and inference without adjusting for variable selection.


Asunto(s)
Algoritmos , Simulación por Computador , Funciones de Verosimilitud , Humanos , Modelos Logísticos , Interpretación Estadística de Datos , Biometría/métodos , Modelos Estadísticos
16.
Stat Med ; 2024 Sep 11.
Artículo en Inglés | MEDLINE | ID: mdl-39260448

RESUMEN

Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the nonrobust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Roc̆ková and George, J Am Stat Associat, 2018, 113(521): 431-444). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with nondifferentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA).

17.
Anal Bioanal Chem ; 416(24): 5351-5364, 2024 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-39096358

RESUMEN

In this study, a new approach for the selection of informative standardization samples from the original calibration set for the transfer of a calibration model between NIR instruments is proposed and evaluated. First, a calibration model is developed, after variable selection by the Final Complexity Adapted Models (FCAM) method, using the significance of the PLS regression coefficients (FCAM-SIG) as selection criterion. Then, the resulting model is used for the selection of the best fitting subset of calibration samples with optimally predictive ability, called the optimally predictive calibration subset (OPCS). Next, the standardization samples are selected from the OPCS. The spectra on the slave instruments are transferred to corresponding spectra on the master instrument by the widely used Piecewise Direct Standardization (PDS) method. Thereafter, for the test set on the slave instrument, a 3D response surface plot is drawn for the root mean squared error of prediction (RMSEP) as a function of the number of OPCS samples and window sizes used for the PDS method. Finally, the smallest set of calibration samples, in combination with the optimal window size, providing the optimal RMSEP, is selected as standardization set. The proposed OPCS approach for the selection of standardization samples is tested on two real-life NIR data sets providing 13 X-y combinations to model. The results show that the obtained numbers of OPCS-based standardization samples are statistically significantly lower than those obtained with the widely used representative sample selection method of Kennard and Stone, while the predictive performances are similar.

18.
Foods ; 13(15)2024 Jul 26.
Artículo en Inglés | MEDLINE | ID: mdl-39123554

RESUMEN

Chlorpyrifos is one of the most widely used broad-spectrum insecticides in agriculture. Given its potential toxicity and residue in food (e.g., tea), establishing a rapid and reliable method for the determination of chlorpyrifos residue is crucial. In this study, a strategy combining surface-enhanced Raman spectroscopy (SERS) and intelligent variable selection models for detecting chlorpyrifos residue in tea was established. First, gold nanostars were fabricated as a SERS sensor for measuring the SERS spectra. Second, the raw SERS spectra were preprocessed to facilitate the quantitative analysis. Third, a partial least squares model and four outstanding intelligent variable selection models, Monte Carlo-based uninformative variable elimination, competitive adaptive reweighted sampling, iteratively retaining informative variables, and variable iterative space shrinkage approach, were developed for detecting chlorpyrifos residue in a comparative study. The repeatability and reproducibility tests demonstrated the excellent stability of the proposed strategy. Furthermore, the sensitivity of the proposed strategy was assessed by estimating limit of detection values of the various models. Finally, two-tailed paired t-tests confirmed that the accuracy of the proposed strategy was equivalent to that of gas chromatography-mass spectrometry. Hence, the proposed method provides a promising strategy for detecting chlorpyrifos residue in tea.

19.
J Am Stat Assoc ; 119(546): 1322-1335, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39184838

RESUMEN

We consider a class of network models, in which the connection probability depends on ultrahigh-dimensional nodal covariates (homophily) and node-specific popularity (degree heterogeneity). A Bayesian method is proposed to select nodal features in both dense and sparse networks under a mild assumption on popularity parameters. The proposed approach is implemented via Gibbs sampling. To alleviate the computational burden for large sparse networks, we further develop a working model in which parameters are updated based on a dense sub-graph at each step. Model selection consistency is established for both models, in the sense that the probability of the true model being selected converges to one asymptotically, even when the dimension grows with the network size at an exponential rate. The performance of the proposed models and estimation procedures are illustrated through Monte Carlo studies and three real world examples.

20.
J Am Stat Assoc ; 119(545): 66-80, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39132605

RESUMEN

Neural demyelination and brain damage accumulated in white matter appear as hyperintense areas on T2-weighted MRI scans in the form of lesions. Modeling binary images at the population level, where each voxel represents the existence of a lesion, plays an important role in understanding aging and inflammatory diseases. We propose a scalable hierarchical Bayesian spatial model, called BLESS, capable of handling binary responses by placing continuous spike-and-slab mixture priors on spatially-varying parameters and enforcing spatial dependency on the parameter dictating the amount of sparsity within the probability of inclusion. The use of mean-field variational inference with dynamic posterior exploration, which is an annealing-like strategy that improves optimization, allows our method to scale to large sample sizes. Our method also accounts for underestimation of posterior variance due to variational inference by providing an approximate posterior sampling approach based on Bayesian bootstrap ideas and spike-and-slab priors with random shrinkage targets. Besides accurate uncertainty quantification, this approach is capable of producing novel cluster size based imaging statistics, such as credible intervals of cluster size, and measures of reliability of cluster occurrence. Lastly, we validate our results via simulation studies and an application to the UK Biobank, a large-scale lesion mapping study with a sample size of 40,000 subjects.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...