RESUMO
Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.
Assuntos
Distribuição Binomial , RNA-Seq/métodos , Microbiota , Distribuição de Poisson , RNA Ribossômico 16S/genética , Análise de RegressãoRESUMO
In this article we construct three smooth goodness-of-fit tests for testing for the zero-inflated Poisson (ZIP) distribution against general smooth alternatives in the sense of Neyman. We apply our tests to a data set previously claimed to be ZIP distributed, and show that the ZIP is not a good model to describe the data. At rejection of the null hypothesis of ZIP, the individual components of the test statistic, which are directly related to interpretable parameters in a smooth model, may be used to gain insight into an alternative distribution.