Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Am Stat Assoc ; 119(545): 332-342, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38660582

RESUMO

Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.

2.
ArXiv ; 2023 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-38076519

RESUMO

For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a single feature between a pair of clusters obtained using hierarchical or k-means clustering. The test based on the proposed p-value controls the selective Type I error rate in finite samples and can be efficiently computed. We further illustrate the validity and power of our proposal in simulation and demonstrate its use on single-cell RNA-sequencing data.

3.
bioRxiv ; 2023 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-37333112

RESUMO

Whole-chromosome aneuploidy and large segmental amplifications can have devastating effects in multicellular organisms, from developmental disorders and miscarriage to cancer. Aneuploidy in single-celled organisms such as yeast also results in proliferative defects and reduced viability. Yet, paradoxically, CNVs are routinely observed in laboratory evolution experiments with microbes grown in stressful conditions. The defects associated with aneuploidy are often attributed to the imbalance of many differentially expressed genes on the affected chromosomes, with many genes each contributing incremental effects. An alternate hypothesis is that a small number of individual genes are large effect 'drivers' of these fitness changes when present in an altered copy number. To test these two views, we have employed a collection of strains bearing large chromosomal amplifications that we previously assayed in nutrient-limited chemostat competitions. In this study, we focus on conditions known to be poorly tolerated by aneuploid yeast-high temperature, treatment with the Hsp90 inhibitor radicicol, and growth in extended stationary phase. To identify potential genes with a large impact on fitness, we fit a piecewise constant model to fitness data across chromosome arms, filtering breakpoints in this model by magnitude to focus on regions with a large impact on fitness in each condition. While fitness generally decreased as the length of the amplification increased, we were able to identify 91 candidate regions that disproportionately impacted fitness when amplified. Consistent with our previous work with this strain collection, nearly all candidate regions were condition specific, with only five regions impacting fitness in multiple conditions.

4.
Biostatistics ; 2022 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-36511385

RESUMO

In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.

5.
Biometrics ; 78(3): 1018-1030, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-33792914

RESUMO

In this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multiview network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database. We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to cocomplex association data. We also extend this proposal to the setting of a network with node covariates. The proposed methods extend readily to three or more network/multivariate data views.


Assuntos
Algoritmos , Proteínas
6.
Artigo em Inglês | MEDLINE | ID: mdl-38481523

RESUMO

We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.

7.
Biostatistics ; 21(4): 692-708, 2020 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30753304

RESUMO

In the Pioneer 100 (P100) Wellness Project, multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this article, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data).


Assuntos
Algoritmos , Proteômica , Análise por Conglomerados , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...