Search | VHL Regional Portal

1.

Application of zero-inflated Poisson model with heterogeneous random effects to evaluate the effect of oral health education on pregnant women's dental caries: A longitudinal experimental study.

Ahmadi Gooraji, Somayeh; Zayeri, Farid; Sharifnejad, Yeganeh; Ghorbani, Zahra; Deghatipour, Marzie; Meymeh, Maryam Heydarpour; Baghban, Alireza Akbarzadeh.

Dent Res J (Isfahan) ; 21: 26, 2024.

Article in English | MEDLINE | ID: mdl-39188390

ABSTRACT

Background: Pregnant women have poor knowledge of oral hygiene during pregnancy. One problem with the follow-up of dental caries in this group is zero accumulation in the decayed, missing, and filled teeth (DMFT) index, for which some models must be used to achieve valid results. The studied population may be heterogeneous in longitudinal studies, leading to biased estimates. We aimed to assess the impact of oral health education on dental caries in pregnant women using a suitable model in a longitudinal experimental study with heterogeneous random effects. Materials and Methods: This longitudinal, experimental research was carried out on pregnant women who visited medical centers in Tehran. The educational group (236 cases) received education for three sessions. The control group (200 cases) received only standard training. The DMFT index assessed oral and dental health at baseline, 6 months, and 24 months after delivery. The Chi-square test was used for comparing nominal variables and the Mann-Whitney U test for ordinal variables. The zero-inflated Poisson (ZIP) model was applied under heterogeneous and homogeneous random effects using R 4.2.1, SPSS 26, and SAS 9.4. The level of significance was set at 0.05. Results: Data from 436 women aged 15 years and older were analyzed. Zero accumulation in the DMFT was mainly related to the filled teeth (51%). The heterogeneous ZIP model fitted better to the data. On average, the intervention group exhibited a higher rate of change in filled teeth over time than the control group (P = 0.021). Conclusion: The proposed ZIP model is a suitable model for predicting filled teeth in pregnant women. An educational intervention during pregnancy can improve oral health in the long-term follow-up.

2.

ZIBGLMM: Zero-Inflated Bivariate Generalized Linear Mixed Model for Meta-Analysis with Double-Zero-Event Studies.

Li, Lu; Lin, Lifeng; Cappelleri, Joseph C; Chu, Haitao; Chen, Yong.

medRxiv ; 2024 Jul 25.

Article in English | MEDLINE | ID: mdl-39108504

ABSTRACT

Double-zero-event studies (DZS) pose a challenge for accurately estimating the overall treatment effect in meta-analysis. Current approaches, such as continuity correction or omission of DZS, are commonly employed, yet these ad hoc methods can yield biased conclusions. Although the standard bivariate generalized linear mixed model can accommodate DZS, it fails to address the potential systemic differences between DZS and other studies. In this paper, we propose a zero-inflated bivariate generalized linear mixed model (ZIBGLMM) to tackle this issue. This two-component finite mixture model includes zero-inflation for a subpopulation with negligible or extremely low risk. We develop both frequentist and Bayesian versions of ZIBGLMM and examine its performance in estimating risk ratios (RRs) against the bivariate generalized linear mixed model and conventional two-stage meta-analysis that excludes DZS. Through extensive simulation studies and real-world meta-analysis case studies, we demonstrate that ZIBGLMM outperforms the bivariate generalized linear mixed model and conventional two-stage meta-analysis that excludes DZS in estimating the true effect size with substantially less bias and comparable coverage probability.

3.

Item Response Modeling of Clinical Instruments With Filter Questions: Disentangling Symptom Presence and Severity.

Magnus, Brooke E.

Appl Psychol Meas ; 48(6): 235-256, 2024 Sep.

Article in English | MEDLINE | ID: mdl-39166184

ABSTRACT

Clinical instruments that use a filter/follow-up response format often produce data with excess zeros, especially when administered to nonclinical samples. When the unidimensional graded response model (GRM) is then fit to these data, parameter estimates and scale scores tend to suggest that the instrument measures individual differences only among individuals with severe levels of the psychopathology. In such scenarios, alternative item response models that explicitly account for excess zeros may be more appropriate. The multivariate hurdle graded response model (MH-GRM), which has been previously proposed for handling zero-inflated questionnaire data, includes two latent variables: susceptibility, which underlies responses to the filter question, and severity, which underlies responses to the follow-up question. Using both simulated and empirical data, the current research shows that compared to unidimensional GRMs, the MH-GRM is better able to capture individual differences across a wider range of psychopathology, and that when unidimensional GRMs are fit to data from questionnaires that include filter questions, individual differences at the lower end of the severity continuum largely go unmeasured. Practical implications are discussed.

4.

Model selection of GLMMs in the analysis of count data in single-case studies: A Monte Carlo simulation.

Li, Haoran.

Behav Res Methods ; 2024 Jul 10.

Article in English | MEDLINE | ID: mdl-38987450

ABSTRACT

Generalized linear mixed models (GLMMs) have great potential to deal with count data in single-case experimental designs (SCEDs). However, applied researchers have faced challenges in making various statistical decisions when using such advanced statistical techniques in their own research. This study focused on a critical issue by investigating the selection of an appropriate distribution to handle different types of count data in SCEDs due to overdispersion and/or zero-inflation. To achieve this, I proposed two model selection frameworks, one based on calculating information criteria (AIC and BIC) and another based on utilizing a multistage-model selection procedure. Four data scenarios were simulated including Poisson, negative binominal (NB), zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB). The same set of models (i.e., Poisson, NB, ZIP, and ZINB) were fitted for each scenario. In the simulation, I evaluated 10 model selection strategies within the two frameworks by assessing the model selection bias and its consequences on the accuracy of the treatment effect estimates and inferential statistics. Based on the simulation results and previous work, I provide recommendations regarding which model selection methods should be adopted in different scenarios. The implications, limitations, and future research directions are also discussed.

5.

A Marginalized Zero-Inflated Negative Binomial Model for Spatial Data: Modeling COVID-19 Deaths in Georgia.

Mutiso, Fedelis; Pearce, John L; Benjamin-Neelon, Sara E; Mueller, Noel T; Li, Hong; Neelon, Brian.

Biom J ; 66(5): e202300182, 2024 Jul.

Article in English | MEDLINE | ID: mdl-39001709

ABSTRACT

Spatial count data with an abundance of zeros arise commonly in disease mapping studies. Typically, these data are analyzed using zero-inflated models, which comprise a mixture of a point mass at zero and an ordinary count distribution, such as the Poisson or negative binomial. However, due to their mixture representation, conventional zero-inflated models are challenging to explain in practice because the parameter estimates have conditional latent-class interpretations. As an alternative, several authors have proposed marginalized zero-inflated models that simultaneously model the excess zeros and the marginal mean, leading to a parameterization that more closely aligns with ordinary count models. Motivated by a study examining predictors of COVID-19 death rates, we develop a spatiotemporal marginalized zero-inflated negative binomial model that directly models the marginal mean, thus extending marginalized zero-inflated models to the spatial setting. To capture the spatiotemporal heterogeneity in the data, we introduce region-level covariates, smooth temporal effects, and spatially correlated random effects to model both the excess zeros and the marginal mean. For estimation, we adopt a Bayesian approach that combines full-conditional Gibbs sampling and Metropolis-Hastings steps. We investigate features of the model and use the model to identify key predictors of COVID-19 deaths in the US state of Georgia during the 2021 calendar year.

Subject(s)

Bayes Theorem , Biometry , COVID-19 , Models, Statistical , Humans , COVID-19/mortality , COVID-19/epidemiology , Georgia/epidemiology , Biometry/methods , Spatial Analysis , Binomial Distribution

6.

Bayesian semi-parametric inference for clustered recurrent events with zero inflation and a terminal event.

Tian, Xinyuan; Ciarleglio, Maria; Cai, Jiachen; Greene, Erich J; Esserman, Denise; Li, Fan; Zhao, Yize.

J R Stat Soc Ser C Appl Stat ; 73(3): 598-620, 2024 Jun.

Article in English | MEDLINE | ID: mdl-39072299

ABSTRACT

Recurrent events are common in clinical studies and are often subject to terminal events. In pragmatic trials, participants are often nested in clinics and can be susceptible or structurally unsusceptible to the recurrent events. We develop a Bayesian shared random effects model to accommodate this complex data structure. To achieve robustness, we consider the Dirichlet processes to model the residual of the accelerated failure time model for the survival process as well as the cluster-specific shared frailty distribution, along with an efficient sampling algorithm for posterior inference. Our method is applied to a recent cluster randomized trial on fall injury prevention.

7.

Analysis of Microbiome Data.

Peterson, Christine B; Saha, Satabdi; Do, Kim-Anh.

Annu Rev Stat Appl ; 11(1): 483-504, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38962089

ABSTRACT

The microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.

8.

A Lindley-binomial model for analyzing the proportions with sparseness and excessive zeros.

Deng, Dianliang; Zhang, Xiaoqing.

J Appl Stat ; 51(9): 1792-1817, 2024.

Article in English | MEDLINE | ID: mdl-38933142

ABSTRACT

Proportional data arise frequently in a wide variety of fields of study. Such data often exhibit extra variation such as over/under dispersion, sparseness and zero inflation. For example, the hepatitis data present both sparseness and zero inflation with 19 contributing non-zero denominators of 5 or less and with 36 having zero seropositive out of 83 annual age groups. The whitefly data consists of 640 observations with 339 zeros (53%), which demonstrates extra zero inflation. The catheter management data involve excessive zeros with over 60% zeros averagely for outcomes of 193 urinary tract infections, 194 outcomes of catheter blockages and 193 outcomes of catheter displacements. However, the existing models cannot always address such features appropriately. In this paper, a new two-parameter probability distribution called Lindley-binomial (LB) distribution is proposed to analyze the proportional data with such features. The probabilistic properties of the distribution such as moment, moment generating function are derived. The Fisher scoring algorithm and EM algorithm are presented for the computation of estimates of parameters in the proposed LB regression model. The issues on goodness of fit for the LB model are discussed. A limited simulation study is also performed to evaluate the performance of derived EM algorithms for the estimation of parameters in the model with/without covariates. The proposed model is illustrated through three aforementioned proportional datasets.

9.

A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data.

Chi, Jinling; Ye, Jimin; Zhou, Ying.

Front Microbiol ; 15: 1394204, 2024.

Article in English | MEDLINE | ID: mdl-38873138

ABSTRACT

Motivation: High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results: We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.

10.

N-mixture models for population estimation: Application in spotted lanternfly egg mass survey.

Liu, Houping; Julian, James T.

Curr Res Insect Sci ; 5: 100078, 2024.

Article in English | MEDLINE | ID: mdl-38576775

ABSTRACT

Population density and structure are critical to nature conservation and pest management. Traditional sampling methods such as capture-mark-recapture and catch-effort can't be used in situations where catching, marking, or removing individuals are not feasible. N-mixture models use repeated count data to estimate population abundance based on detection probability. They are widely adopted in wildlife surveys in recent years to account for imperfect detection. However, its application in entomology is relatively new. In this paper, we describe the general procedures of N-mixture models in population studies from data collection to model fitting and evaluation. Using Lycorma delicatula egg mass survey data at 28 plots in seven sites from the field, we found that detection probability (p) was negatively correlated with tree diameter at breast height (DBH), ranged from 0.516 [95 % CI: 0.470-0.561] to 0.614 [95 % CI: 0.566-0.660] between the 1st and the 3rd sample period. Furthermore, egg mass abundance (λ) was positively associated with basal area (BA) for the sample unit (single tree), with more egg masses on tree of heaven (TOH) trees. More egg masses were also expected on trees of other species in TOH plots. Predicted egg mass density (masses/100 m2) ranged from 5.0 (95 % CI: 3.0-16.0) (Gordon) to 276.9 (95 % CI: 255.0-303.0) (Susquehannock) for TOH plots, and 11.0 (95 % CI: 9.00-15.33) (Gordon) to 228.3 (95 % CI: 209.7-248.3) (Burlington) for nonTOH plots. Site-specific abundance estimates from N-mixture models were generally higher compared to observed maximum counts. N-mixture models could have great potential in insect population surveys in agriculture and forestry in the future.

11.

Diagnostics for regression models with semicontinuous outcomes.

Yang, Lu.

Biometrics ; 80(1)2024 Jan 29.

Article in English | MEDLINE | ID: mdl-38470256

ABSTRACT

Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.

Subject(s)

Health Expenditures

12.

Multilevel modeling in single-case studies with zero-inflated and overdispersed count data.

Li, Haoran; Luo, Wen; Baek, Eunkyeng.

Behav Res Methods ; 56(4): 2765-2781, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38383801

ABSTRACT

Count outcomes are frequently encountered in single-case experimental designs (SCEDs). Generalized linear mixed models (GLMMs) have shown promise in handling overdispersed count data. However, the presence of excessive zeros in the baseline phase of SCEDs introduces a more complex issue known as zero-inflation, often overlooked by researchers. This study aimed to deal with zero-inflated and overdispersed count data within a multiple-baseline design (MBD) in single-case studies. It examined the performance of various GLMMs (Poisson, negative binomial [NB], zero-inflated Poisson [ZIP], and zero-inflated negative binomial [ZINB] models) in estimating treatment effects and generating inferential statistics. Additionally, a real example was used to demonstrate the analysis of zero-inflated and overdispersed count data. The simulation results indicated that the ZINB model provided accurate estimates for treatment effects, while the other three models yielded biased estimates. The inferential statistics obtained from the ZINB model were reliable when the baseline rate was low. However, when the data were overdispersed but not zero-inflated, both the ZINB and ZIP models exhibited poor performance in accurately estimating treatment effects. These findings contribute to our understanding of using GLMMs to handle zero-inflated and overdispersed count data in SCEDs. The implications, limitations, and future research directions are also discussed.

Subject(s)

Single-Case Studies as Topic , Humans , Linear Models , Multilevel Analysis/methods , Data Interpretation, Statistical , Models, Statistical , Poisson Distribution , Computer Simulation , Research Design

13.

Joint modeling the frequency and duration of accelerometer-measured physical activity from a lifestyle intervention trial.

Siddique, Juned; Daniels, Michael J; Inan, Gül; Battalio, Samuel; Spring, Bonnie; Hedeker, Donald.

Stat Med ; 42(28): 5100-5112, 2023 12 10.

Article in English | MEDLINE | ID: mdl-37715594

ABSTRACT

Physical activity (PA) guidelines recommend that PA be accumulated in bouts of 10 minutes or more in duration. Recently, researchers have sought to better understand how participants in PA interventions increase their activity. Participants can increase their daily PA by increasing the number of PA bouts per day while keeping the duration of the bouts constant; they can keep the number of bouts constant but increase the duration of each bout; or participants can increase both the number of bouts and their duration. We propose a novel joint modeling framework for modeling PA bouts and their duration over time. Our joint model is comprised of two sub-models: a mixed-effects Poisson hurdle sub-model for the number of bouts per day and a mixed-effects location scale gamma regression sub-model to characterize the duration of the bouts and their variance. The model allows us to estimate how daily PA bouts and their duration vary together over the course of an intervention and by treatment condition and is specifically designed to capture the unique distributional features of bouted PA as measured by accelerometer: frequent measurements, zero-inflated bouts, and skewed bout durations. We apply our methods to the Make Better Choices study, a longitudinal lifestyle intervention trial to increase PA. We perform a simulation study to evaluate how well our model is able to estimate relationships between outcomes.

Subject(s)

Exercise , Life Style , Humans , Accelerometry/methods , Time Factors , Clinical Trials as Topic

14.

A flexible quasi-likelihood model for microbiome abundance count data.

Shi, Yiming; Li, Huilin; Wang, Chan; Chen, Jun; Jiang, Hongmei; Shih, Ya-Chen T; Zhang, Haixiang; Song, Yizhe; Feng, Yang; Liu, Lei.

Stat Med ; 42(25): 4632-4643, 2023 11 10.

Article in English | MEDLINE | ID: mdl-37607718

ABSTRACT

In this article, we present a flexible model for microbiome count data. We consider a quasi-likelihood framework, in which we do not make any assumptions on the distribution of the microbiome count except that its variance is an unknown but smooth function of the mean. By comparing our model to the negative binomial generalized linear model (GLM) and Poisson GLM in simulation studies, we show that our flexible quasi-likelihood method yields valid inferential results. Using a real microbiome study, we demonstrate the utility of our method by examining the relationship between adenomas and microbiota. We also provide an R package "fql" for the application of our method.

Subject(s)

Microbiota , Models, Statistical , Humans , Likelihood Functions , Computer Simulation , Poisson Distribution

15.

Improving performance of hurdle models using rare-event weighted logistic regression: an application to maternal mortality data.

Awuor Okello, Sharon; Otieno Omondi, Evans; Odhiambo, Collins O.

R Soc Open Sci ; 10(8): 221226, 2023 Aug.

Article in English | MEDLINE | ID: mdl-37621657

ABSTRACT

In this paper, performance of hurdle models in rare events data is improved by modifying their binary component. The rare-event weighted logistic regression model is adopted in place of logistic regression to deal with class imbalance due to rare events. Poisson Hurdle Rare Event Weighted Logistic Regression (REWLR) and Negative Binomial Hurdle (NBH) REWLR are developed as two-part models which use the REWLR model to estimate the probability of a positive count and a Poisson or NB zero-truncated count model to estimate non-zero counts. This research aimed to develop and assess the performance of the Poisson and Negative Binomial (NB) Hurdle Rare Event Weighted Logistic Regression (REWLR) models, applied to simulated data with various degrees of zero inflation and to Nairobi county's maternal mortality data. The study data on maternal mortality were pulled from JPHES. The data contain the number of maternal deaths, which is the outcome variable, and other obstetric and demographic factors recorded in MNCH facilities in Nairobi between October 2021 and January 2022. The models were also fit and evaluated based on simulated data with varying degrees of zero inflation. The obtained results are numerically validated and then discussed from both the mathematical and the maternal mortality perspective. Numerical simulations are also presented to give a more complete representation of the model dynamics. Results obtained suggest that NB Hurdle REWLR is the best performing model for zero inflated count data due to rare events.

16.

A zero-inflated endemic-epidemic model with an application to measles time series in Germany.

Lu, Junyi; Meyer, Sebastian.

Biom J ; 65(8): e2100408, 2023 12.

Article in English | MEDLINE | ID: mdl-37439440

ABSTRACT

Count data with an excess of zeros are often encountered when modeling infectious disease occurrence. The degree of zero inflation can vary over time due to nonepidemic periods as well as by age group or region. A well-established approach to analyze multivariate incidence time series is the endemic-epidemic modeling framework, also known as the HHH approach. However, it assumes Poisson or negative binomial distributions and is thus not tailored to surveillance data with excess zeros. Here, we propose a multivariate zero-inflated endemic-epidemic model with random effects that extends HHH. Parameters of both the zero-inflation probability and the HHH part of this mixture model can be estimated jointly and efficiently via (penalized) maximum likelihood inference using analytical derivatives. We found proper convergence and good coverage of confidence intervals in simulation studies. An application to measles counts in the 16 German states, 2005-2018, showed that zero inflation is more pronounced in the Eastern states characterized by a higher vaccination coverage. Probabilistic forecasts of measles cases improved when accounting for zero inflation. We anticipate zero-inflated HHH models to be a useful extension also for other applications and provide an implementation in an R package.

Subject(s)

Measles , Models, Statistical , Humans , Time Factors , Computer Simulation , Measles/epidemiology , Measles/prevention & control , Germany/epidemiology , Poisson Distribution

17.

Disease mapping for spatially semi-continuous data by estimating equations with application to dengue control.

Lin, Pei-Sheng; Yu, Yih-Jeng; Zhu, Jun.

Stat Med ; 42(20): 3636-3648, 2023 09 10.

Article in English | MEDLINE | ID: mdl-37316997

ABSTRACT

Disease mapping is a research field to estimate spatial pattern of disease risks so that areas with elevated risk levels can be identified. The motivation of this article is from a study of dengue fever infection, which causes seasonal epidemics in almost every summer in Taiwan. For analysis of zero-inflated data with spatial correlation and covariates, current methods would either cause a computational burden or miss associations between zero and non-zero responses. In this article, we develop estimating equations for a mixture regression model that accommodates spatial dependence and zero inflation for study of disease propagation. Asymptotic properties for the proposed estimates are established. A simulation study is conducted to evaluate performance of the mixture estimating equations; and a dengue dataset from southern Taiwan is used to illustrate the proposed method.

Subject(s)

Dengue , Epidemics , Humans , Computer Simulation , Spatial Analysis , Taiwan/epidemiology , Dengue/epidemiology , Dengue/prevention & control , Models, Statistical

18.

Quantification and statistical modeling of droplet-based single-nucleus RNA-sequencing data.

Kuo, Albert; Hansen, Kasper D; Hicks, Stephanie C.

Biostatistics ; 2023 May 31.

Article in English | MEDLINE | ID: mdl-37257175

ABSTRACT

In complex tissues containing cells that are difficult to dissociate, single-nucleus RNA-sequencing (snRNA-seq) has become the preferred experimental technology over single-cell RNA-sequencing (scRNA-seq) to measure gene expression. To accurately model these data in downstream analyses, previous work has shown that droplet-based scRNA-seq data are not zero-inflated, but whether droplet-based snRNA-seq data follow the same probability distributions has not been systematically evaluated. Using pseudonegative control data from nuclei in mouse cortex sequenced with the 10x Genomics Chromium system and mouse kidney sequenced with the DropSeq system, we found that droplet-based snRNA-seq data follow a negative binomial distribution, suggesting that parametric statistical models applied to scRNA-seq are transferable to snRNA-seq. Furthermore, we found that the quantification choices in adapting quantification mapping strategies from scRNA-seq to snRNA-seq can play a significant role in downstream analyses and biological interpretation. In particular, reference transcriptomes that do not include intronic regions result in significantly smaller library sizes and incongruous cell type classifications. We also confirmed the presence of a gene length bias in snRNA-seq data, which we show is present in both exonic and intronic reads, and investigate potential causes for the bias.

19.

A Bayesian zero-inflated Dirichlet-multinomial regression model for multivariate compositional count data.

Koslovsky, Matthew D.

Biometrics ; 79(4): 3239-3251, 2023 12.

Article in English | MEDLINE | ID: mdl-36896642

ABSTRACT

The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome dataset are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other datasets.

Subject(s)

Gastrointestinal Microbiome , Microbiota , Humans , Models, Statistical , Bayes Theorem , Poisson Distribution

20.

A general averaging method for count data with overdispersion and/or excess zeros in biomedicine.

Liu, Yin; Zhou, Jianghong; Chen, Zhanshou; Zhang, Xinyu.

Stat Methods Med Res ; 32(5): 904-926, 2023 05.

Article in English | MEDLINE | ID: mdl-36919477

ABSTRACT

With the aim of providing better estimation for count data with overdispersion and/or excess zeros, we develop a novel estimation method-optimal weighting based on cross-validation-for the zero-inflated negative binomial model, where the Poisson, negative binomial, and zero-inflated Poisson models are all included as its special cases. To facilitate the selection of the optimal weight vector, a K-fold cross-validation technique is adopted. Unlike the jackknife model averaging discussed in Hansen and Racine (2012), the proposed method deletes one group of observations rather than only one observation to enhance the computational efficiency. Furthermore, we also theoretically prove the asymptotic optimality of the newly developed optimal weighting based on cross-validation method. Simulation studies and three empirical applications indicate the superiority of the presented optimal weighting based on cross-validation method when compared with the three commonly used information-based model selection methods and their model averaging counterparts.

Subject(s)

Models, Statistical , Research Design , Poisson Distribution , Computer Simulation

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL