*Stat Methods Med Res ; 33(3): 465-479, 2024 Mar.*

##### RESUMO

The weighted sum of binomial proportions and the interaction effect are two important cases of the linear combination of binomial proportions. Existing confidence intervals for these two parameters are approximate. We apply the h-function method to a given approximate interval and obtain an exact interval. The process is repeated multiple times until the final-improved interval (exact) cannot be shortened. In particular, for the weighted sum of two proportions, we derive two final-improved intervals based on the (approximate) adjusted score and fiducial intervals. After comparing several currently used intervals, we recommend these two final-improved intervals for practice. For the weighted sum of three proportions and the interaction effect, the final-improved interval based on the adjusted score interval should be used. Three real datasets are used to detail how the approximate intervals are improved.

##### Assuntos

Modelos Estatísticos , Distribuição Binomial , Intervalos de Confiança*PLoS Comput Biol ; 20(2): e1011856, 2024 Feb.*

##### RESUMO

Outbreaks of emerging and zoonotic infections represent a substantial threat to human health and well-being. These outbreaks tend to be characterised by highly stochastic transmission dynamics with intense variation in transmission potential between cases. The negative binomial distribution is commonly used as a model for transmission in the early stages of an epidemic as it has a natural interpretation as the convolution of a Poisson contact process and a gamma-distributed infectivity. In this study we expand upon the negative binomial model by introducing a beta-Poisson mixture model in which infectious individuals make contacts at the points of a Poisson process and then transmit infection along these contacts with a beta-distributed probability. We show that the negative binomial distribution is a limit case of this model, as is the zero-inflated Poisson distribution obtained by combining a Poisson-distributed contact process with an additional failure probability. We assess the beta-Poisson model's applicability by fitting it to secondary case distributions (the distribution of the number of subsequent cases generated by a single case) estimated from outbreaks covering a range of pathogens and geographical settings. We find that while the beta-Poisson mixture can achieve a closer to fit to data than the negative binomial distribution, it is consistently outperformed by the negative binomial in terms of Akaike Information Criterion, making it a suboptimal choice on parsimonious grounds. The beta-Poisson performs similarly to the negative binomial model in its ability to capture features of the secondary case distribution such as overdispersion, prevalence of superspreaders, and the probability of a case generating zero subsequent cases. Despite this possible shortcoming, the beta-Poisson distribution may still be of interest in the context of intervention modelling since its structure allows for the simulation of measures which change contact structures while leaving individual-level infectivity unchanged, and vice-versa.

##### Assuntos

Surtos de Doenças , Modelos Estatísticos , Humanos , Simulação por Computador , Distribuição de Poisson , Distribuição Binomial*JASA Express Lett ; 4(2)2024 Feb 01.*

##### RESUMO

Partial credit scoring for speech recognition tasks can improve measurement precision. However, assessing the magnitude of this improvement with partial credit scoring is challenging because meaningful speech contains contextual cues, which create correlations between the probabilities of correctly identifying each token in a stimulus. Here, beta-binomial distributions were used to estimate recognition accuracy and intraclass correlation for phonemes in words and words in sentences in listeners with cochlear implants (N = 20). Estimates demonstrated substantial intraclass correlation in recognition accuracy within stimuli. These correlations were invariant across individuals. Intraclass correlations should be addressed in power analysis of partial credit scoring.

##### Assuntos

Implante Coclear , Implantes Cocleares , Percepção da Fala , Humanos , Distribuição Binomial , Fala*Pharmacoepidemiol Drug Saf ; 33(2): e5750, 2024 Feb.*

##### RESUMO

PURPOSE: Outcome variables that are assumed to follow a negative binomial distribution are frequently used in both clinical and epidemiological studies. Epidemiological studies, particularly those performed by pharmaceutical companies often aim to describe a population rather than compare treatments. Such descriptive studies are often analysed using confidence intervals. While precision calculations and sample size calculations are not always performed in these settings, they have the important role of setting expectations of what results the study may generate. Current methods for precision calculations for the negative binomial rate are based on plugging in parameter values into the confidence interval formulae. This method has the downside of ignoring the randomness of the confidence interval limits. To enable better practice for precision calculations, methods are needed that address the randomness. METHODS: Using the well-known delta-method we develop a method for calculating the precision probability, that is, the probability of achieving a certain width. We assess the performance of the method in smaller samples through simulations. RESULTS: The method for the precision probability performs well in small to medium sample sizes, and the usefulness of the method is demonstrated through an example. CONCLUSIONS: We have developed a simple method for calculating the precision probability for negative binomial rates. This method can be used when planning epidemiological studies in for example, asthma, while correctly taking the randomness of confidence intervals into account.

##### Assuntos

Modelos Estatísticos , Humanos , Tamanho da Amostra , Probabilidade , Distribuição Binomial , Intervalos de Confiança*Pharm Stat ; 23(1): 46-59, 2024.*

##### RESUMO

Count outcomes are collected in clinical trials for new drug development in several therapeutic areas and the event rate is commonly used as a single primary endpoint. Count outcomes that are greater than the mean value are termed overdispersion; thus, count outcomes are assumed to have a negative binomial distribution. However, in clinical trials for treating asthma and chronic obstructive pulmonary disease (COPD), a regulatory agency has suggested that a continuous endpoint related to lung function must be evaluated as a primary endpoint in addition to the event rate. The two co-primary endpoints that need to be evaluated include overdispersed count and continuous outcomes. Some researchers have proposed sample size calculation methods in the context of co-primary endpoints for various outcome types. However, methodologies for sample size calculation in trials with two co-primary endpoints, including overdispersed count and continuous outcomes, required when planning clinical trials for treating asthma and COPD, remain to be proposed. In this study, we aimed to develop a hypothesis-testing method and a corresponding sample size calculation method with two co-primary endpoints including overdispersed count and continuous outcomes. In a simulation, we demonstrated that the proposed sample size calculation method has adequate power accuracy. In addition, we illustrated an application of the proposed sample size calculation method to a placebo-controlled Phase 3 trial for patients with COPD.

##### Assuntos

Asma , Doença Pulmonar Obstrutiva Crônica , Humanos , Tamanho da Amostra , Asma/tratamento farmacológico , Doença Pulmonar Obstrutiva Crônica/diagnóstico , Doença Pulmonar Obstrutiva Crônica/tratamento farmacológico , Distribuição Binomial , Simulação por Computador*Stat Med ; 43(1): 125-140, 2024 01 15.*

##### RESUMO

Timeline followback (TLFB) is often used in addiction research to monitor recent substance use, such as the number of abstinent days in the past week. TLFB data usually take the form of binomial counts that exhibit overdispersion and zero inflation. Motivated by a 12-week randomized trial evaluating the efficacy of varenicline tartrate for smoking cessation among adolescents, we propose a Bayesian zero-inflated beta-binomial model for the analysis of longitudinal, bounded TLFB data. The model comprises a mixture of a point mass that accounts for zero inflation and a beta-binomial distribution for the number of days abstinent in the past week. Because treatment effects appear to level off during the study, we introduce random changepoints for each study group to reflect group-specific changes in treatment efficacy over time. The model also includes fixed and random effects that capture group- and subject-level slopes before and after the changepoints. Using the model, we can accurately estimate the mean trend for each study group, test whether the groups experience changepoints simultaneously, and identify critical windows of treatment efficacy. For posterior computation, we propose an efficient Markov chain Monte Carlo algorithm that relies on easily sampled Gibbs and Metropolis-Hastings steps. Our application shows that the varenicline group has a short-term positive effect on abstinence that tapers off after week 9.

##### Assuntos

Modelos Estatísticos , Transtornos Relacionados ao Uso de Substâncias , Adolescente , Humanos , Teorema de Bayes , Distribuição Binomial , Algoritmos*BMC Bioinformatics ; 24(1): 314, 2023 Aug 18.*

##### RESUMO

Existing methods for generating synthetic genotype data are ill-suited for replicating the effects of assortative mating (AM). We propose rb_dplr, a novel and computationally efficient algorithm for generating high-dimensional binary random variates that effectively recapitulates AM-induced genetic architectures using the Bahadur order-2 approximation of the multivariate Bernoulli distribution. The rBahadur R library is available through the Comprehensive R Archive Network at https://CRAN.R-project.org/package=rBahadur .

##### Assuntos

Algoritmos , Comunicação Celular , Distribuição Binomial , Simulação por Computador , Genótipo*BMC Bioinformatics ; 24(1): 187, 2023 May 08.*

##### RESUMO

BACKGROUND: The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. RESULTS: We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. CONCLUSIONS: With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .

##### Assuntos

Algoritmos , Mama , Masculino , Humanos , Mutação , Distribuição Binomial , Simulação por Computador*Stat Methods Med Res ; 32(7): 1300-1317, 2023 07.*

##### RESUMO

The zero-inflated negative binomial distribution has been widely used for count data analyses in various biomedical settings due to its capacity of modeling excess zeros and overdispersion. When there are correlated count variables, a bivariate model is essential for understanding their full distributional features. Examples include measuring correlation of two genes in sparse single-cell RNA sequencing data and modeling dental caries count indices on two different tooth surface types. For these purposes, we develop a richly parametrized bivariate zero-inflated negative binomial model that has a simple latent variable framework and eight free parameters with intuitive interpretations. In the scRNA-seq data example, the correlation is estimated after adjusting for the effects of dropout events represented by excess zeros. In the dental caries data, we analyze how the treatment with Xylitol lozenges affects the marginal mean and other patterns of response manifested in the two dental caries traits. An R package "bzinb" is available on Comprehensive R Archive Network.

##### Assuntos

Cárie Dentária , Humanos , Modelos Estatísticos , Distribuição Binomial , Análise de Dados , Distribuição de Poisson*Stat Methods Med Res ; 32(3): 474-492, 2023 03.*

##### RESUMO

Changes in cognitive function over time are of interest in ageing research. A joint model is constructed to investigate. Generally, cognitive function is measured through more than one test, and the test scores are integers. The aim is to investigate two test scores and use an extension of a bivariate binomial distribution to define a new joint model. This bivariate distribution model the correlation between the two test scores. To deal with attrition due to death, the Weibull hazard model and the Gompertz hazard model are used. A shared random-effects model is constructed, and the random effects are assumed to follow a bivariate normal distribution. It is shown how to incorporate random effects that link the bivariate longitudinal model and the survival model. The joint model is applied to the English Longitudinal Study of Ageing data.

##### Assuntos

Cognição , Modelos Estatísticos , Estudos Longitudinais , Modelos de Riscos Proporcionais , Distribuição Binomial*Int J Biostat ; 19(1): 21-38, 2023 05 01.*

##### RESUMO

Meta-analysis of binary outcome data faces often a situation where studies with a rare event are part of the set of studies to be considered. These studies have low occurrence of event counts to the extreme that no events occur in one or both groups to be compared. This raises issues how to estimate validly the summary risk or rate ratio across studies. A preferred choice is the Mantel-Haenszel estimator, which is still defined in the situation of zero studies unless all studies have zeros in one of the groups to be compared. For this situation, a modified Mantel-Haenszel estimator is suggested and shown to perform well by means of simulation work. Also, confidence interval estimation is discussed and evaluated in a simulation study. In a second part, heterogeneity of relative risk across studies is investigated with a new chi-square type statistic which is based on a conditional binomial distribution where the conditioning is on the event margin for each study. This is necessary as the conventional Q-statistic is undefined in the occurrence of zero studies. The null-distribution of the proposed Q-statistic is obtained by means of a parametric bootstrap as a chi-square approximation is not valid for rare events meta-analysis, as bootstrapping of the null-distribution shows. In addition, for the effect heterogeneity situation, confidence interval estimation is considered using a nonparametric bootstrap procedure. The proposed techniques are illustrated at hand of three meta-analytic data sets.

##### Assuntos

Risco , Razão de Chances , Simulação por Computador , Distribuição Binomial*Biom J ; 65(2): e2200073, 2023 02.*

##### RESUMO

Common count distributions, such as the Poisson (binomial) distribution for unbounded (bounded) counts considered here, can be characterized by appropriate Stein identities. These identities, in turn, might be utilized to define a corresponding goodness-of-fit (GoF) test, the test statistic of which involves the computation of weighted means for a user-selected weight function f. Here, the choice of f should be done with respect to the relevant alternative scenario, as it will have great impact on the GoF-test's performance. We derive the asymptotics of both the Poisson and binomial Stein-type GoF-statistic for general count distributions (we also briefly consider the negative-binomial case), such that the asymptotic power is easily computed for arbitrary alternatives. This allows for an efficient implementation of optimal Stein tests, that is, which are most powerful within a given class F $\mathcal {F}$ of weight functions. The performance and application of the optimal Stein-type GoF-tests is investigated by simulations and several medical data examples.

##### Assuntos

Modelos Estatísticos , Distribuição Binomial*Braz. j. biol ; 83: 1-8, 2023. map, tab, graf*

##### RESUMO

The intertidal rocky shores in continental Chile have high species diversity mainly in northern Chile (18-27° S), and one of the most widespread species is the gastropod Echinolittorina peruviana (Lamarck, 1822). The aim of the present study is do a first characterization of spatial distribution of E. peruviana in along rocky shore in Antofagasta town in northern Chile. Individuals were counted in nine different sites that also were determined their spectral properties using remote sensing techniques (LANDSAT ETM+). The results revealed that sites without marked human intervention have more abundant in comparison to sites located in the town, also in all studied sites was found an aggregated pattern, and in six of these sites were found a negative binomial distribution. The low density related to sites with human intervention is supported when spectral properties for sites were included. These results would agree with other similar results for rocky shore in northern and southern Chile.

As costas rochosas entremarés no Chile continental apresentam alta diversidade de espécies, principalmente no norte do país (18-27 ° S), e uma das espécies mais difundidas é o gastrópode Echinolittorina peruviana (Lamarck, 1822). O objetivo do presente estudo é fazer uma primeira caracterização da distribuição espacial de E. peruviana no costão rochoso da cidade de Antofagasta no norte do Chile. Os indivíduos foram contados em nove locais diferentes onde também foram determinadas suas propriedades espectrais usando técnicas de sensoriamento remoto (LANDSAT ETM +). Os resultados revelaram que os locais sem intervenção humana marcada apresentam maior abundância em comparação aos locais localizados no município. Também em todos os locais estudados foi encontrado um padrão agregado, sendo que em seis desses locais foi encontrada uma distribuição binomial negativa. A baixa densidade relacionada a sites com intervenção humana é suportada quando as propriedades espectrais para sites foram incluídas. Esses resultados concordariam com outros resultados semelhantes para costões rochosos no norte e no sul do Chile.

##### Assuntos

Animais , Ambiente Marinho , Costa , Gastrópodes/crescimento & desenvolvimento , Tecnologia de Sensoriamento Remoto , Distribuição Binomial*J Acoust Soc Am ; 152(3): 1404, 2022 09.*

##### RESUMO

Speech-recognition tests are a routine component of the clinical hearing evaluation. The most common type of test uses recorded monosyllabic words presented in quiet. The interpretation of test scores relies on an understanding of the variance of repeated tests. Confidence intervals are useful for determining if two scores are significantly different or if the difference is due to the variability of test scores. Because the response to each test item is binary, either correct or incorrect, the binomial distribution has been used to estimate confidence intervals. This method requires that test scores be independent. If the scores are not independent, the binomial distribution will not accurately estimate the variance of repeated scores. A previously published dataset with repeated scores from normal-hearing and hearing-impaired listeners was used to derive confidence intervals from actual test scores in contrast to the predicted confidence intervals in earlier reports. This analysis indicates that confidence intervals predicted by the binomial distribution substantially overestimate the variance of repeated scores resulting in erroneously broad confidence intervals. High correlations were found for repeated scores, indicating that scores are not independent. The interdependence of repeated scores invalidates confidence intervals predicted by the binomial distribution. Confidence intervals and confidence levels for repeated measures were determined empirically from measured test scores to assist in interpreting differences between repeat scores.

##### Assuntos

Perda Auditiva Neurossensorial , Percepção da Fala , Distribuição Binomial , Intervalos de Confiança , Humanos , Fala , Testes de Discriminação da Fala/métodos , Percepção da Fala/fisiologia , Teste do Limiar de Recepção da Fala*PLoS One ; 17(9): e0264246, 2022.*

##### RESUMO

RNA-seq is a high-throughput sequencing technology widely used for gene transcript discovery and quantification under different biological or biomedical conditions. A fundamental research question in most RNA-seq experiments is the identification of differentially expressed genes among experimental conditions or sample groups. Numerous statistical methods for RNA-seq differential analysis have been proposed since the emergence of the RNA-seq assay. To evaluate popular differential analysis methods used in the open source R and Bioconductor packages, we conducted multiple simulation studies to compare the performance of eight RNA-seq differential analysis methods used in RNA-seq data analysis (edgeR, DESeq, DESeq2, baySeq, EBSeq, NOISeq, SAMSeq, Voom). The comparisons were across different scenarios with either equal or unequal library sizes, different distribution assumptions and sample sizes. We measured performance using false discovery rate (FDR) control, power, and stability. No significant differences were observed for FDR control, power, or stability across methods, whether with equal or unequal library sizes. For RNA-seq count data with negative binomial distribution, when sample size is 3 in each group, EBSeq performed better than the other methods as indicated by FDR control, power, and stability. When sample sizes increase to 6 or 12 in each group, DESeq2 performed slightly better than other methods. All methods have improved performance when sample size increases to 12 in each group except DESeq. For RNA-seq count data with log-normal distribution, both DESeq and DESeq2 methods performed better than other methods in terms of FDR control, power, and stability across all sample sizes. Real RNA-seq experimental data were also used to compare the total number of discoveries and stability of discoveries for each method. For RNA-seq data analysis, the EBSeq method is recommended for studies with sample size as small as 3 in each group, and the DESeq2 method is recommended for sample size of 6 or higher in each group when the data follow the negative binomial distribution. Both DESeq and DESeq2 methods are recommended when the data follow the log-normal distribution.

##### Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Distribuição Binomial , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA-Seq , Tamanho da Amostra , Análise de Sequência de RNA/métodos*BMC Med Res Methodol ; 22(1): 211, 2022 08 04.*

##### RESUMO

BACKGROUND: Hospital length of stay (LOS) is a key indicator of hospital care management efficiency, cost of care, and hospital planning. Hospital LOS is often used as a measure of a post-medical procedure outcome, as a guide to the benefit of a treatment of interest, or as an important risk factor for adverse events. Therefore, understanding hospital LOS variability is always an important healthcare focus. Hospital LOS data can be treated as count data, with discrete and non-negative values, typically right skewed, and often exhibiting excessive zeros. In this study, we compared the performance of the Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB) regression models using simulated and empirical data. METHODS: Data were generated under different simulation scenarios with varying sample sizes, proportions of zeros, and levels of overdispersion. Analysis of hospital LOS was conducted using empirical data from the Medical Information Mart for Intensive Care database. RESULTS: Results showed that Poisson and ZIP models performed poorly in overdispersed data. ZIP outperformed the rest of the regression models when the overdispersion is due to zero-inflation only. NB and ZINB regression models faced substantial convergence issues when incorrectly used to model equidispersed data. NB model provided the best fit in overdispersed data and outperformed the ZINB model in many simulation scenarios with combinations of zero-inflation and overdispersion, regardless of the sample size. In the empirical data analysis, we demonstrated that fitting incorrect models to overdispersed data leaded to incorrect regression coefficients estimates and overstated significance of some of the predictors. CONCLUSIONS: Based on this study, we recommend to the researchers that they consider the ZIP models for count data with zero-inflation only and NB models for overdispersed data or data with combinations of zero-inflation and overdispersion. If the researcher believes there are two different data generating mechanisms producing zeros, then the ZINB regression model may provide greater flexibility when modeling the zero-inflation and overdispersion.

##### Assuntos

Hospitais , Modelos Estatísticos , Distribuição Binomial , Humanos , Tempo de Internação , Distribuição de Poisson*Biom J ; 64(5): 912-933, 2022 06.*

##### RESUMO

The identification and treatment of "one-inflation" in estimating the size of an elusive population has received increasing attention in capture-recapture literature in recent years. The phenomenon occurs when the number of units captured exactly once clearly exceeds the expectation under a baseline count distribution. Ignoring one-inflation has serious consequences for estimation of the population size, which can be drastically overestimated. In this paper we propose a Bayesian approach for Poisson, geometric, and negative binomial one-inflated count distributions. Posterior inference for population size will be obtained applying a Gibbs sampler approach. We also provide a Bayesian approach to model selection. We illustrate the proposed methodology with simulated and real data and propose a new application in official statistics to estimate the number of people implicated in the exploitation of prostitution in Italy.

##### Assuntos

Modelos Estatísticos , Teorema de Bayes , Distribuição Binomial , Humanos , Distribuição de Poisson , Densidade Demográfica*Stat Med ; 41(12): 2191-2204, 2022 05 30.*

##### RESUMO

Cluster randomized trials (CRT) have been widely employed in medical and public health research. Many clinical count outcomes, such as the number of falls in nursing homes, exhibit excessive zero values. In the presence of zero inflation, traditional power analysis methods for count data based on Poisson or negative binomial distribution may be inadequate. In this study, we present a sample size method for CRTs with zero-inflated count outcomes. It is developed based on GEE regression directly modeling the marginal mean of a zero-inflated Poisson outcome, which avoids the challenge of testing two intervention effects under traditional modeling approaches. A closed-form sample size formula is derived which properly accounts for zero inflation, ICCs due to clustering, unbalanced randomization, and variability in cluster size. Robust approaches, including t-distribution-based approximation and Jackknife re-sampling variance estimator, are employed to enhance trial properties under small sample sizes. Extensive simulations are conducted to evaluate the performance of the proposed method. An application example is presented in a real clinical trial setting.

##### Assuntos

Modelos Estatísticos , Distribuição Binomial , Análise por Conglomerados , Simulação por Computador , Humanos , Distribuição de Poisson , Ensaios Clínicos Controlados Aleatórios como Assunto , Tamanho da Amostra*BMC Med Res Methodol ; 22(1): 32, 2022 01 30.*

##### RESUMO

BACKGROUND: We consider cluster size data of SARS-CoV-2 transmissions for a number of different settings from recently published data. The statistical characteristics of superspreading events are commonly described by fitting a negative binomial distribution to secondary infection and cluster size data as an alternative to the Poisson distribution as it is a longer tailed distribution, with emphasis given to the value of the extra parameter which allows the variance to be greater than the mean. Here we investigate whether other long tailed distributions from more general extended Poisson process modelling can better describe the distribution of cluster sizes for SARS-CoV-2 transmissions. METHODS: We use the extended Poisson process modelling (EPPM) approach with nested sets of models that include the Poisson and negative binomial distributions to assess the adequacy of models based on these standard distributions for the data considered. RESULTS: We confirm the inadequacy of the Poisson distribution in most cases, and demonstrate the inadequacy of the negative binomial distribution in some cases. CONCLUSIONS: The probability of a superspreading event may be underestimated by use of the negative binomial distribution as much larger tail probabilities are indicated by EPPM distributions than negative binomial alternatives. We show that the large shared accommodation, meal and work settings, of the settings considered, have the potential for more severe superspreading events than would be predicted by a negative binomial distribution. Therefore public health efforts to prevent transmission in such settings should be prioritised.

##### Assuntos

COVID-19 , Pandemias , Distribuição Binomial , Humanos , Distribuição de Poisson , SARS-CoV-2*BMC Bioinformatics ; 23(1): 2, 2022 Jan 04.*

##### RESUMO

Cellular heterogeneity underlies cancer evolution and metastasis. Advances in single-cell technologies such as single-cell RNA sequencing and mass cytometry have enabled interrogation of cell type-specific expression profiles and abundance across heterogeneous cancer samples obtained from clinical trials and preclinical studies. However, challenges remain in determining sample sizes needed for ascertaining changes in cell type abundances in a controlled study. To address this statistical challenge, we have developed a new approach, named Sensei, to determine the number of samples and the number of cells that are required to ascertain such changes between two groups of samples in single-cell studies. Sensei expands the t-test and models the cell abundances using a beta-binomial distribution. We evaluate the mathematical accuracy of Sensei and provide practical guidelines on over 20 cell types in over 30 cancer types based on knowledge acquired from the cancer cell atlas (TCGA) and prior single-cell studies. We provide a web application to enable user-friendly study design via https://kchen-lab.github.io/sensei/table_beta.html .