Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 138
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Biostatistics ; 2023 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-37257175

RESUMO

In complex tissues containing cells that are difficult to dissociate, single-nucleus RNA-sequencing (snRNA-seq) has become the preferred experimental technology over single-cell RNA-sequencing (scRNA-seq) to measure gene expression. To accurately model these data in downstream analyses, previous work has shown that droplet-based scRNA-seq data are not zero-inflated, but whether droplet-based snRNA-seq data follow the same probability distributions has not been systematically evaluated. Using pseudonegative control data from nuclei in mouse cortex sequenced with the 10x Genomics Chromium system and mouse kidney sequenced with the DropSeq system, we found that droplet-based snRNA-seq data follow a negative binomial distribution, suggesting that parametric statistical models applied to scRNA-seq are transferable to snRNA-seq. Furthermore, we found that the quantification choices in adapting quantification mapping strategies from scRNA-seq to snRNA-seq can play a significant role in downstream analyses and biological interpretation. In particular, reference transcriptomes that do not include intronic regions result in significantly smaller library sizes and incongruous cell type classifications. We also confirmed the presence of a gene length bias in snRNA-seq data, which we show is present in both exonic and intronic reads, and investigate potential causes for the bias.

2.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38470256

RESUMO

Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.


Assuntos
Gastos em Saúde
3.
Stat Med ; 43(24): 4752-4767, 2024 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-39193779

RESUMO

BACKGROUND: Outcome measures that are count variables with excessive zeros are common in health behaviors research. Examples include the number of standard drinks consumed or alcohol-related problems experienced over time. There is a lack of empirical data about the relative performance of prevailing statistical models for assessing the efficacy of interventions when outcomes are zero-inflated, particularly compared with recently developed marginalized count regression approaches for such data. METHODS: The current simulation study examined five commonly used approaches for analyzing count outcomes, including two linear models (with outcomes on raw and log-transformed scales, respectively) and three prevailing count distribution-based models (ie, Poisson, negative binomial, and zero-inflated Poisson (ZIP) models). We also considered the marginalized zero-inflated Poisson (MZIP) model, a novel alternative that estimates the overall effects on the population mean while adjusting for zero-inflation. Motivated by alcohol misuse prevention trials, extensive simulations were conducted to evaluate and compare the statistical power and Type I error rate of the statistical models and approaches across data conditions that varied in sample size ( N = 100 $$ N=100 $$ to 500), zero rate (0.2 to 0.8), and intervention effect sizes. RESULTS: Under zero-inflation, the Poisson model failed to control the Type I error rate, resulting in higher than expected false positive results. When the intervention effects on the zero (vs. non-zero) and count parts were in the same direction, the MZIP model had the highest statistical power, followed by the linear model with outcomes on the raw scale, negative binomial model, and ZIP model. The performance of the linear model with a log-transformed outcome variable was unsatisfactory. CONCLUSIONS: The MZIP model demonstrated better statistical properties in detecting true intervention effects and controlling false positive results for zero-inflated count outcomes. This MZIP model may serve as an appealing analytical approach to evaluating overall intervention effects in studies with count outcomes marked by excessive zeros.


Assuntos
Simulação por Computador , Modelos Estatísticos , Humanos , Distribuição de Poisson , Modelos Lineares , Tamanho da Amostra , Avaliação de Resultados em Cuidados de Saúde/estatística & dados numéricos , Interpretação Estatística de Dados , Alcoolismo , Consumo de Bebidas Alcoólicas/epidemiologia , Distribuição Binomial
4.
BMC Infect Dis ; 24(1): 1006, 2024 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-39300391

RESUMO

BACKGROUND: It is difficult to detect the outbreak of emergency infectious disease based on the exiting surveillance system. Here we investigate the utility of the Baidu Search Index, an indicator of how large of a keyword is in Baidu's search volume, in the early warning and predicting the epidemic trend of COVID-19. METHODS: The daily number of cases and the Baidu Search Index of 8 keywords (weighted by population) from December 1, 2019 to March 15, 2020 were collected and analyzed with times series and Spearman correlation with different time lag. To predict the daily number of COVID-19 cases using the Baidu Search Index, Zero-inflated negative binomial regression was used in phase 1 and negative binomial regression model was used in phase 2 and phase 3 based on the characteristic of independent variable. RESULTS: The Baidu Search Index of all keywords in Wuhan was significantly higher than Hubei (excluded Wuhan) and China (excluded Hubei). Before the causative pathogen was identified, the search volume of "Influenza" and "Pneumonia" in Wuhan increased with the number of new onset cases, their correlation coefficient was 0.69 and 0.59, respectively. After the pathogen was public but before COVID-19 was classified as a notifiable disease, the search volume of "SARS", "Pneumonia", "Coronavirus" in all study areas increased with the number of new onset cases with the correlation coefficient was 0.69 ~ 0.89, while "Influenza" changed to negative correlated (rs: -0.56 ~ -0.64). After COVID-19 was closely monitored, the Baidu Search Index of "COVID-19", "Pneumonia", "Coronavirus", "SARS" and "Mask" could predict the epidemic trend with 15 days, 5 days and 6 days lead time, respectively in Wuhan, Hubei (excluded Wuhan) and China (excluded Hubei). The predicted number of cases would increase 1.84 and 4.81 folds, respectively than the actual number of cases in Wuhan and Hubei (excluded Wuhan) from 21 January to 9 February. CONCLUSION: The Baidu Search Index could be used in the early warning and predicting the epidemic trend of COVID-19, but the search keywords changed in different period. Considering the time lag from onset to diagnosis, especially in the areas with medical resources shortage, internet search data can be a highly effective supplement of the existing surveillance system.


Assuntos
COVID-19 , Surtos de Doenças , Monitoramento Epidemiológico , Modelos Estatísticos , Análise de Regressão , Ferramenta de Busca , Humanos , COVID-19/epidemiologia , China/epidemiologia , Fatores de Tempo , SARS-CoV-2/fisiologia
5.
Biom J ; 66(5): e202300182, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39001709

RESUMO

Spatial count data with an abundance of zeros arise commonly in disease mapping studies. Typically, these data are analyzed using zero-inflated models, which comprise a mixture of a point mass at zero and an ordinary count distribution, such as the Poisson or negative binomial. However, due to their mixture representation, conventional zero-inflated models are challenging to explain in practice because the parameter estimates have conditional latent-class interpretations. As an alternative, several authors have proposed marginalized zero-inflated models that simultaneously model the excess zeros and the marginal mean, leading to a parameterization that more closely aligns with ordinary count models. Motivated by a study examining predictors of COVID-19 death rates, we develop a spatiotemporal marginalized zero-inflated negative binomial model that directly models the marginal mean, thus extending marginalized zero-inflated models to the spatial setting. To capture the spatiotemporal heterogeneity in the data, we introduce region-level covariates, smooth temporal effects, and spatially correlated random effects to model both the excess zeros and the marginal mean. For estimation, we adopt a Bayesian approach that combines full-conditional Gibbs sampling and Metropolis-Hastings steps. We investigate features of the model and use the model to identify key predictors of COVID-19 deaths in the US state of Georgia during the 2021 calendar year.


Assuntos
Teorema de Bayes , Biometria , COVID-19 , Modelos Estatísticos , Humanos , COVID-19/mortalidade , COVID-19/epidemiologia , Georgia/epidemiologia , Biometria/métodos , Análise Espacial , Distribuição Binomial
6.
Behav Res Methods ; 56(4): 2765-2781, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38383801

RESUMO

Count outcomes are frequently encountered in single-case experimental designs (SCEDs). Generalized linear mixed models (GLMMs) have shown promise in handling overdispersed count data. However, the presence of excessive zeros in the baseline phase of SCEDs introduces a more complex issue known as zero-inflation, often overlooked by researchers. This study aimed to deal with zero-inflated and overdispersed count data within a multiple-baseline design (MBD) in single-case studies. It examined the performance of various GLMMs (Poisson, negative binomial [NB], zero-inflated Poisson [ZIP], and zero-inflated negative binomial [ZINB] models) in estimating treatment effects and generating inferential statistics. Additionally, a real example was used to demonstrate the analysis of zero-inflated and overdispersed count data. The simulation results indicated that the ZINB model provided accurate estimates for treatment effects, while the other three models yielded biased estimates. The inferential statistics obtained from the ZINB model were reliable when the baseline rate was low. However, when the data were overdispersed but not zero-inflated, both the ZINB and ZIP models exhibited poor performance in accurately estimating treatment effects. These findings contribute to our understanding of using GLMMs to handle zero-inflated and overdispersed count data in SCEDs. The implications, limitations, and future research directions are also discussed.


Assuntos
Estudos de Caso Único como Assunto , Humanos , Modelos Lineares , Análise Multinível/métodos , Interpretação Estatística de Dados , Modelos Estatísticos , Distribuição de Poisson , Simulação por Computador , Projetos de Pesquisa
7.
Behav Res Methods ; 56(7): 7963-7984, 2024 10.
Artigo em Inglês | MEDLINE | ID: mdl-38987450

RESUMO

Generalized linear mixed models (GLMMs) have great potential to deal with count data in single-case experimental designs (SCEDs). However, applied researchers have faced challenges in making various statistical decisions when using such advanced statistical techniques in their own research. This study focused on a critical issue by investigating the selection of an appropriate distribution to handle different types of count data in SCEDs due to overdispersion and/or zero-inflation. To achieve this, I proposed two model selection frameworks, one based on calculating information criteria (AIC and BIC) and another based on utilizing a multistage-model selection procedure. Four data scenarios were simulated including Poisson, negative binominal (NB), zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB). The same set of models (i.e., Poisson, NB, ZIP, and ZINB) were fitted for each scenario. In the simulation, I evaluated 10 model selection strategies within the two frameworks by assessing the model selection bias and its consequences on the accuracy of the treatment effect estimates and inferential statistics. Based on the simulation results and previous work, I provide recommendations regarding which model selection methods should be adopted in different scenarios. The implications, limitations, and future research directions are also discussed.


Assuntos
Método de Monte Carlo , Modelos Lineares , Humanos , Estudos de Caso Único como Assunto , Simulação por Computador , Interpretação Estatística de Dados , Modelos Estatísticos , Distribuição de Poisson , Projetos de Pesquisa
8.
Biostatistics ; 24(1): 161-176, 2022 12 12.
Artigo em Inglês | MEDLINE | ID: mdl-34520533

RESUMO

Single-cell RNA-sequencing (scRNAseq) data contain a high level of noise, especially in the form of zero-inflation, that is, the presence of an excessively large number of zeros. This is largely due to dropout events and amplification biases that occur in the preparation stage of single-cell experiments. Recent scRNAseq experiments have been augmented with unique molecular identifiers (UMI) and External RNA Control Consortium (ERCC) molecules which can be used to account for zero-inflation. However, most of the current methods on graphical models are developed under the assumption of the multivariate Gaussian distribution or its variants, and thus they are not able to adequately account for an excessively large number of zeros in scRNAseq data. In this article, we propose a single-cell latent graphical model (scLGM)-a Bayesian hierarchical model for estimating the conditional dependency network among genes using scRNAseq data. Taking advantage of UMI and ERCC data, scLGM explicitly models the two sources of zero-inflation. Our simulation study and real data analysis demonstrate that the proposed approach outperforms several existing methods.


Assuntos
RNA , Análise de Célula Única , Humanos , Análise de Sequência de RNA/métodos , Teorema de Bayes , RNA/genética , Simulação por Computador
9.
Biostatistics ; 23(1): 50-68, 2022 01 13.
Artigo em Inglês | MEDLINE | ID: mdl-32282877

RESUMO

Joint models for a longitudinal biomarker and a terminal event have gained interests for evaluating cancer clinical trials because the tumor evolution reflects directly the state of the disease. A biomarker characterizing the tumor size evolution over time can be highly informative for assessing treatment options and could be taken into account in addition to the survival time. The biomarker often has a semicontinuous distribution, i.e., it is zero inflated and right skewed. An appropriate model is needed for the longitudinal biomarker as well as an association structure with the survival outcome. In this article, we propose a joint model for a longitudinal semicontinuous biomarker and a survival time. The semicontinuous nature of the longitudinal biomarker is specified by a two-part model, which splits its distribution into a binary outcome (first part) represented by the positive versus zero values and a continuous outcome (second part) with the positive values only. Survival times are modeled with a proportional hazards model for which we propose three association structures with the biomarker. Our simulation studies show some bias can arise in the parameter estimates when the semicontinuous nature of the biomarker is ignored, assuming the true model is a two-part model. An application to advanced metastatic colorectal cancer data from the GERCOR study is performed where our two-part model is compared to one-part joint models. Our results show that treatment arm B (FOLFOX6/FOLFIRI) is associated to higher SLD values over time and its positive association with the terminal event leads to an increased risk of death compared to treatment arm A (FOLFIRI/FOLFOX6).


Assuntos
Neoplasias Colorretais , Modelos Estatísticos , Biomarcadores , Neoplasias Colorretais/tratamento farmacológico , Simulação por Computador , Humanos , Estudos Longitudinais
10.
Biometrics ; 79(4): 3239-3251, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-36896642

RESUMO

The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome dataset are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other datasets.


Assuntos
Microbioma Gastrointestinal , Microbiota , Humanos , Modelos Estatísticos , Teorema de Bayes , Distribuição de Poisson
11.
Stat Med ; 42(25): 4632-4643, 2023 11 10.
Artigo em Inglês | MEDLINE | ID: mdl-37607718

RESUMO

In this article, we present a flexible model for microbiome count data. We consider a quasi-likelihood framework, in which we do not make any assumptions on the distribution of the microbiome count except that its variance is an unknown but smooth function of the mean. By comparing our model to the negative binomial generalized linear model (GLM) and Poisson GLM in simulation studies, we show that our flexible quasi-likelihood method yields valid inferential results. Using a real microbiome study, we demonstrate the utility of our method by examining the relationship between adenomas and microbiota. We also provide an R package "fql" for the application of our method.


Assuntos
Microbiota , Modelos Estatísticos , Humanos , Funções Verossimilhança , Simulação por Computador , Distribuição de Poisson
12.
Stat Med ; 42(20): 3636-3648, 2023 09 10.
Artigo em Inglês | MEDLINE | ID: mdl-37316997

RESUMO

Disease mapping is a research field to estimate spatial pattern of disease risks so that areas with elevated risk levels can be identified. The motivation of this article is from a study of dengue fever infection, which causes seasonal epidemics in almost every summer in Taiwan. For analysis of zero-inflated data with spatial correlation and covariates, current methods would either cause a computational burden or miss associations between zero and non-zero responses. In this article, we develop estimating equations for a mixture regression model that accommodates spatial dependence and zero inflation for study of disease propagation. Asymptotic properties for the proposed estimates are established. A simulation study is conducted to evaluate performance of the mixture estimating equations; and a dengue dataset from southern Taiwan is used to illustrate the proposed method.


Assuntos
Dengue , Epidemias , Humanos , Simulação por Computador , Análise Espacial , Taiwan/epidemiologia , Dengue/epidemiologia , Dengue/prevenção & controle , Modelos Estatísticos
13.
Stat Med ; 42(28): 5100-5112, 2023 12 10.
Artigo em Inglês | MEDLINE | ID: mdl-37715594

RESUMO

Physical activity (PA) guidelines recommend that PA be accumulated in bouts of 10 minutes or more in duration. Recently, researchers have sought to better understand how participants in PA interventions increase their activity. Participants can increase their daily PA by increasing the number of PA bouts per day while keeping the duration of the bouts constant; they can keep the number of bouts constant but increase the duration of each bout; or participants can increase both the number of bouts and their duration. We propose a novel joint modeling framework for modeling PA bouts and their duration over time. Our joint model is comprised of two sub-models: a mixed-effects Poisson hurdle sub-model for the number of bouts per day and a mixed-effects location scale gamma regression sub-model to characterize the duration of the bouts and their variance. The model allows us to estimate how daily PA bouts and their duration vary together over the course of an intervention and by treatment condition and is specifically designed to capture the unique distributional features of bouted PA as measured by accelerometer: frequent measurements, zero-inflated bouts, and skewed bout durations. We apply our methods to the Make Better Choices study, a longitudinal lifestyle intervention trial to increase PA. We perform a simulation study to evaluate how well our model is able to estimate relationships between outcomes.


Assuntos
Exercício Físico , Estilo de Vida , Humanos , Acelerometria/métodos , Fatores de Tempo , Ensaios Clínicos como Assunto
14.
Biom J ; 65(8): e2100408, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37439440

RESUMO

Count data with an excess of zeros are often encountered when modeling infectious disease occurrence. The degree of zero inflation can vary over time due to nonepidemic periods as well as by age group or region. A well-established approach to analyze multivariate incidence time series is the endemic-epidemic modeling framework, also known as the HHH approach. However, it assumes Poisson or negative binomial distributions and is thus not tailored to surveillance data with excess zeros. Here, we propose a multivariate zero-inflated endemic-epidemic model with random effects that extends HHH. Parameters of both the zero-inflation probability and the HHH part of this mixture model can be estimated jointly and efficiently via (penalized) maximum likelihood inference using analytical derivatives. We found proper convergence and good coverage of confidence intervals in simulation studies. An application to measles counts in the 16 German states, 2005-2018, showed that zero inflation is more pronounced in the Eastern states characterized by a higher vaccination coverage. Probabilistic forecasts of measles cases improved when accounting for zero inflation. We anticipate zero-inflated HHH models to be a useful extension also for other applications and provide an implementation in an R package.


Assuntos
Sarampo , Modelos Estatísticos , Humanos , Fatores de Tempo , Simulação por Computador , Sarampo/epidemiologia , Sarampo/prevenção & controle , Alemanha/epidemiologia , Distribuição de Poisson
15.
Biom J ; 65(4): e2200090, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36732909

RESUMO

Disease mapping models have been popularly used to model disease incidence with spatial correlation. In disease mapping models, zero inflation is an important issue, which often occurs in disease incidence datasets with high proportions of zero disease count. It is originated from limited survey coverage or unadvanced testing equipment, which makes some regions have no observed patients. Then excessive zeros recorded in the disease incidence dataset would mess up the true distributions of disease incidence and lead to inaccurate estimates. To address this issue, a zero-inflated disease mapping model is developed in this work. In this model, a zero-inflated process using Bernoulli indicators is assumed to characterize whether the zero inflation occurs for each region. For regions without zero inflation, a coherent and generative disease mapping model is applied for mapping the spatially correlated disease incidence. Independent spatial random effects are incorporated in both processes to account for the spatial patterns of zero inflation and disease incidence. External covariates are also considered in both processes to better explain the disease count data. To estimate the model, a Markov chain Monte Carlo algorithm is proposed. We evaluate model performance via a variety of simulation experiments. Finally, a Lyme disease dataset of Virginia is analyzed to illustrate the application of the proposed model.


Assuntos
Algoritmos , Modelos Estatísticos , Humanos , Incidência , Distribuição de Poisson , Simulação por Computador , Método de Monte Carlo
16.
Ecol Lett ; 25(12): 2739-2752, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36269686

RESUMO

Species' responses to broad-scale environmental or spatial gradients are typically unimodal. Current models of species' responses along gradients tend to be overly simplistic (e.g., linear, quadratic or Gaussian GLMs), or are suitably flexible (e.g., splines, GAMs) but lack direct ecologically interpretable parameters. We describe a parametric framework for species-environment non-linear modelling ('senlm'). The framework has two components: (i) a non-linear parametric mathematical function to model the mean species response along a gradient that allows asymmetry, flattening/peakedness or bimodality; and (ii) a statistical error distribution tailored for ecological data types, allowing intrinsic mean-variance relationships and zero-inflation. We demonstrate the utility of this model framework, highlighting the flexibility of a range of possible mean functions and a broad range of potential error distributions, in analyses of fish species' abundances along a depth gradient, and how they change over time and at different latitudes.


Assuntos
Meio Ambiente , Dinâmica não Linear , Animais , Análise Espacial , Peixes
17.
Biometrics ; 78(4): 1686-1698, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34213763

RESUMO

Recent studies have suggested that the temporal dynamics of the human microbiome may have associations with human health and disease. An increasing number of longitudinal microbiome studies, which record time to disease onset, aim to identify candidate microbes as biomarkers for prognosis. Owing to the ultra-skewness and sparsity of microbiome proportion (relative abundance) data, directly applying traditional statistical methods may result in substantial power loss or spurious inferences. We propose a novel joint modeling framework [JointMM], which is comprised of two sub-models: a longitudinal sub-model called zero-inflated scaled-beta generalized linear mixed-effects regression to depict the temporal structure of microbial proportions among subjects; and a survival sub-model to characterize the occurrence of an event and its relationship with the longitudinal microbiome proportions. JointMM is specifically designed to handle the zero-inflated and highly skewed longitudinal microbial proportion data and examine whether the temporal pattern of microbial presence and/or the nonzero microbial proportions are associated with differences in the time to an event. The longitudinal sub-model of JointMM also provides the capacity to investigate how the (time-varying) covariates are related to the temporal microbial presence/absence patterns and/or the changing trend in nonzero proportions. Comprehensive simulations and real data analyses are used to assess the statistical efficiency and interpretability of JointMM.


Assuntos
Microbioma Gastrointestinal , Microbiota , Humanos , Modelos Estatísticos , Modelos Lineares , Estudos Longitudinais
18.
Biometrics ; 78(2): 766-776, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-33720414

RESUMO

Interactions between biological molecules in a cell are tightly coordinated and often highly dynamic. As a result of these varying signaling activities, changes in gene coexpression patterns could often be observed. The advancements in next-generation sequencing technologies bring new statistical challenges for studying these dynamic changes of gene coexpression. In recent years, methods have been developed to examine genomic information from individual cells. Single-cell RNA sequencing (scRNA-seq) data are count-based, and often exhibit characteristics such as overdispersion and zero inflation. To explore the dynamic dependence structure in scRNA-seq data and other zero-inflated count data, new approaches are needed. In this paper, we consider overdispersion and zero inflation in count outcomes and propose a ZEro-inflated negative binomial dynamic COrrelation model (ZENCO). The observed count data are modeled as a mixture of two components: success amplifications and dropout events in ZENCO. A latent variable is incorporated into ZENCO to model the covariate-dependent correlation structure. We conduct simulation studies to evaluate the performance of our proposed method and to compare it with existing approaches. We also illustrate the implementation of our proposed approach using scRNA-seq data from a study of minimal residual disease in melanoma.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Modelos Estatísticos , Simulação por Computador , Análise de Sequência de RNA/métodos , Sequenciamento do Exoma
19.
Stat Med ; 41(18): 3492-3510, 2022 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-35656596

RESUMO

The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , Humanos , RNA-Seq , Análise de Sequência de RNA , Software
20.
Stat Med ; 41(16): 3180-3198, 2022 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-35429179

RESUMO

In many medical and social science studies, count responses with excess zeros are very common and often the primary outcome of interest. Such count responses are usually generated under some clustered correlation structures due to longitudinal observations of subjects. To model such longitudinal count data with excess zeros, the zero-inflated binomial (ZIB) models for bounded outcomes, and the zero-inflated negative binomial (ZINB) and zero-inflated poisson (ZIP) models for unbounded outcomes all are popular methods. To alleviate the effects of deviations from model assumptions, a semiparametric (or, distribution-free) weighted generalized estimating equations has been proposed to estimate model parameters when data are subject to missingness. In this article, we further explore important covariates for the response variable. Without assumptions on the data distribution, a model selection criterion based on the expected weighted quadratic loss is proposed to select an appropriate subset of covariates, especially when count responses have excess zeros and data are subject to nonmonotone missingness in both responses and covariates. To understand the selection effects of the percentages of excess zeros and missingness, we design various scenarios for covariate selection in the mean model via simulation studies and a real data example regarding the study of cardiovascular disease is also presented for illustration.


Assuntos
Doenças Cardiovasculares , Modelos Estatísticos , Simulação por Computador , Humanos , Distribuição de Poisson , Redução de Peso
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA