Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 80
Filtrar
1.
J Am Stat Assoc ; 119(545): 332-342, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38660582

RESUMO

Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.

2.
PLoS Comput Biol ; 19(10): e1011509, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37824442

RESUMO

A major goal of computational neuroscience is to build accurate models of the activity of neurons that can be used to interpret their function in circuits. Here, we explore using functional cell types to refine single-cell models by grouping them into functionally relevant classes. Formally, we define a hierarchical generative model for cell types, single-cell parameters, and neural responses, and then derive an expectation-maximization algorithm with variational inference that maximizes the likelihood of the neural recordings. We apply this "simultaneous" method to estimate cell types and fit single-cell models from simulated data, and find that it accurately recovers the ground truth parameters. We then apply our approach to in vitro neural recordings from neurons in mouse primary visual cortex, and find that it yields improved prediction of single-cell activity. We demonstrate that the discovered cell-type clusters are well separated and generalizable, and thus amenable to interpretation. We then compare discovered cluster memberships with locational, morphological, and transcriptomic data. Our findings reveal the potential to improve models of neural responses by explicitly allowing for shared functional properties across neurons.


Assuntos
Algoritmos , Neurônios , Camundongos , Animais , Simulação por Computador , Neurônios/fisiologia , Probabilidade , Modelos Neurológicos , Potenciais de Ação/fisiologia
3.
bioRxiv ; 2023 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-37333112

RESUMO

Whole-chromosome aneuploidy and large segmental amplifications can have devastating effects in multicellular organisms, from developmental disorders and miscarriage to cancer. Aneuploidy in single-celled organisms such as yeast also results in proliferative defects and reduced viability. Yet, paradoxically, CNVs are routinely observed in laboratory evolution experiments with microbes grown in stressful conditions. The defects associated with aneuploidy are often attributed to the imbalance of many differentially expressed genes on the affected chromosomes, with many genes each contributing incremental effects. An alternate hypothesis is that a small number of individual genes are large effect 'drivers' of these fitness changes when present in an altered copy number. To test these two views, we have employed a collection of strains bearing large chromosomal amplifications that we previously assayed in nutrient-limited chemostat competitions. In this study, we focus on conditions known to be poorly tolerated by aneuploid yeast-high temperature, treatment with the Hsp90 inhibitor radicicol, and growth in extended stationary phase. To identify potential genes with a large impact on fitness, we fit a piecewise constant model to fitness data across chromosome arms, filtering breakpoints in this model by magnitude to focus on regions with a large impact on fitness in each condition. While fitness generally decreased as the length of the amplification increased, we were able to identify 91 candidate regions that disproportionately impacted fitness when amplified. Consistent with our previous work with this strain collection, nearly all candidate regions were condition specific, with only five regions impacting fitness in multiple conditions.

4.
bioRxiv ; 2023 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-36909648

RESUMO

A major goal of computational neuroscience is to build accurate models of the activity of neurons that can be used to interpret their function in circuits. Here, we explore using functional cell types to refine single-cell models by grouping them into functionally relevant classes. Formally, we define a hierarchical generative model for cell types, single-cell parameters, and neural responses, and then derive an expectation-maximization algorithm with variational inference that maximizes the likelihood of the neural recordings. We apply this "simultaneous" method to estimate cell types and fit single-cell models from simulated data, and find that it accurately recovers the ground truth parameters. We then apply our approach to in vitro neural recordings from neurons in mouse primary visual cortex, and find that it yields improved prediction of single-cell activity. We demonstrate that the discovered cell-type clusters are well separated and generalizable, and thus amenable to interpretation. We then compare discovered cluster memberships with locational, morphological, and transcriptomic data. Our findings reveal the potential to improve models of neural responses by explicitly allowing for shared functional properties across neurons.

5.
bioRxiv ; 2023 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-36993278

RESUMO

Material- and cell-based technologies such as engineered tissues hold great promise as human therapies. Yet, the development of many of these technologies becomes stalled at the stage of pre-clinical animal studies due to the tedious and low-throughput nature of in vivo implantation experiments. We introduce a 'plug and play' in vivo screening array platform called Highly Parallel Tissue Grafting (HPTG). HPTG enables parallelized in vivo screening of 43 three-dimensional microtissues within a single 3D printed device. Using HPTG, we screen microtissue formations with varying cellular and material components and identify formulations that support vascular self-assembly, integration and tissue function. Our studies highlight the importance of combinatorial studies that vary cellular and material formulation variables concomitantly, by revealing that inclusion of stromal cells can "rescue" vascular self-assembly in manner that is material-dependent. HPTG provides a route for accelerating pre-clinical progress for diverse medical applications including tissue therapy, cancer biomedicine, and regenerative medicine.

6.
Biostatistics ; 24(2): 481-501, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-34654923

RESUMO

In recent years, a number of methods have been proposed to estimate the times at which a neuron spikes on the basis of calcium imaging data. However, quantifying the uncertainty associated with these estimated spikes remains an open problem. We consider a simple and well-studied model for calcium imaging data, which states that calcium decays exponentially in the absence of a spike, and instantaneously increases when a spike occurs. We wish to test the null hypothesis that the neuron did not spike-i.e., that there was no increase in calcium-at a particular timepoint at which a spike was estimated. In this setting, classical hypothesis tests lead to inflated Type I error, because the spike was estimated on the same data used for testing. To overcome this problem, we propose a selective inference approach. We describe an efficient algorithm to compute finite-sample $p$-values that control selective Type I error, and confidence intervals with correct selective coverage, for spikes estimated using a recent proposal from the literature. We apply our proposal in simulation and on calcium imaging data from the $\texttt{spikefinder}$ challenge.


Assuntos
Cálcio , Diagnóstico por Imagem , Humanos , Incerteza , Potenciais de Ação/fisiologia , Simulação por Computador , Algoritmos
7.
J Am Stat Assoc ; 118(544): 2383-2393, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38283734

RESUMO

We propose a sparse reduced rank Huber regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained nonconvex optimization problem, which is then solved using a block coordinate descent and an alternating direction method of multipliers algorithm. We establish nonasymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded (1+δ) th moment with δ∈(0,1), the rate of convergence is a function of δ, and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. We illustrate the performance of the proposed method via extensive numerical studies and a data application. Supplementary materials for this article are available online.

8.
J Mach Learn Res ; 242023 May.
Artigo em Inglês | MEDLINE | ID: mdl-38264325

RESUMO

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

9.
J Comput Graph Stat ; 32(2): 577-587, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38250478

RESUMO

The graph fused lasso-which includes as a special case the one-dimensional fused lasso-is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fused lasso. A naive procedure such as a z-test for a difference in means will not control the selective Type I error, since the hypothesis that we are testing is itself a function of the data. In this work, we propose a new test for this task that controls the selective Type I error, and conditions on less information than existing approaches, leading to substantially higher power. We illustrate our approach in simulation and on datasets of drug overdose death rates and teenage birth rates in the contiguous United States. Our approach yields more discoveries on both datasets. Supplementary materials for this article are available online.

10.
Biostatistics ; 2022 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-36511385

RESUMO

In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.

11.
J R Stat Soc Series B Stat Methodol ; 84(4): 1082-1104, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-36419504

RESUMO

While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post-detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it is possible to efficiently carry out this framework in the case of changepoints estimated by binary segmentation and its variants, ℓ 0 segmentation, or the fused lasso. Our setup allows us to condition on much less information than existing approaches, which yields higher powered tests. We apply our proposals in a simulation study and on a dataset of chromosomal guanine-cytosine content. These approaches are freely available in the R package ChangepointInference at https://jewellsean.github.io/changepoint-inference/.

12.
Biometrics ; 78(3): 1018-1030, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-33792914

RESUMO

In this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multiview network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database. We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to cocomplex association data. We also extend this proposal to the setting of a network with node covariates. The proposed methods extend readily to three or more network/multivariate data views.


Assuntos
Algoritmos , Proteínas
13.
Artigo em Inglês | MEDLINE | ID: mdl-38481523

RESUMO

We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.

14.
Stat ; 11(1)2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38250253

RESUMO

The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.

15.
PLoS One ; 16(6): e0252345, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34086726

RESUMO

Calcium imaging has led to discoveries about neural correlates of behavior in subcortical neurons, including dopamine (DA) neurons. However, spike inference methods have not been tested in most populations of subcortical neurons. To address this gap, we simultaneously performed calcium imaging and electrophysiology in DA neurons in brain slices and applied a recently developed spike inference algorithm to the GCaMP fluorescence. This revealed that individual spikes can be inferred accurately in this population. Next, we inferred spikes in vivo from calcium imaging from these neurons during Pavlovian conditioning, as well as during navigation in virtual reality. In both cases, we quantitatively recapitulated previous in vivo electrophysiological observations. Our work provides a validated approach to infer spikes from calcium imaging in DA neurons and implies that aspects of both tonic and phasic spike patterns can be recovered.


Assuntos
Cálcio/metabolismo , Dopamina/metabolismo , Neurônios Dopaminérgicos/metabolismo , Potenciais de Ação/fisiologia , Algoritmos , Animais , Encéfalo/metabolismo , Sinalização do Cálcio/fisiologia , Condicionamento Clássico/fisiologia , Fenômenos Eletrofisiológicos/fisiologia , Camundongos
17.
Stat Sci ; 36(4): 562-577, 2021 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37860618

RESUMO

A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.

18.
Comput Med Imaging Graph ; 87: 101832, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33302246

RESUMO

BACKGROUND: Pathologists analyze biopsy material at both the cellular and structural level to determine diagnosis and cancer stage. Mitotic figures are surrogate biomarkers of cellular proliferation that can provide prognostic information; thus, their precise detection is an important factor for clinical care. Convolutional Neural Networks (CNNs) have shown remarkable performance on several recognition tasks. Utilizing CNNs for mitosis classification may aid pathologists to improve the detection accuracy. METHODS: We studied two state-of-the-art CNN-based models, ESPNet and DenseNet, for mitosis classification on six whole slide images of skin biopsies and compared their quantitative performance in terms of sensitivity, specificity, and F-score. We used raw RGB images of mitosis and non-mitosis samples with their corresponding labels as training input. In order to compare with other work, we studied the performance of these classifiers and two other architectures, ResNet and ShuffleNet, on the publicly available MITOS breast biopsy dataset and compared the performance of all four in terms of precision, recall, and F-score (which are standard for this data set), architecture, training time and inference time. RESULTS: The ESPNet and DenseNet results on our primary melanoma dataset had a sensitivity of 0.976 and 0.968, and a specificity of 0.987 and 0.995, respectively, with F-scores of .968 and .976, respectively. On the MITOS dataset, ESPNet and DenseNet showed a sensitivity of 0.866 and 0.916, and a specificity of 0.973 and 0.980, respectively. The MITOS results using DenseNet had a precision of 0.939, recall of 0.916, and F-score of 0.927. The best published result on MITOS (Saha et al. 2018) reported precision of 0.92, recall of 0.88, and F-score of 0.90. In our architecture comparisons on MITOS, we found that DenseNet beats the others in terms of F-Score (DenseNet 0.927, ESPNet 0.890, ResNet 0.865, ShuffleNet 0.847) and especially Recall (DenseNet 0.916, ESPNet 0.866, ResNet 0.807, ShuffleNet 0.753), while ResNet and ESPNet have much faster inference times (ResNet 6 s, ESPNet 8 s, DenseNet 31 s). ResNet is faster than ESPNet, but ESPNet has a higher F-Score and Recall than ResNet, making it a good compromise solution. CONCLUSION: We studied several state-of-the-art CNNs for detecting mitotic figures in whole slide biopsy images. We evaluated two CNNs on a melanoma cancer dataset and then compared four CNNs on a public breast cancer data set, using the same methodology on both. Our methodology and architecture for mitosis finding in both melanoma and breast cancer whole slide images has been thoroughly tested and is likely to be useful for finding mitoses in any whole slide biopsy images.


Assuntos
Neoplasias da Mama , Aprendizado de Máquina , Feminino , Humanos , Mitose , Redes Neurais de Computação
19.
Ann Appl Stat ; 14(1): 94-115, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32983313

RESUMO

Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.

20.
Biometrika ; 107(2): 293-310, 2020 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-32454528

RESUMO

The fused lasso, also known as total-variation denoising, is a locally adaptive function estimator over a regular grid of design points. In this article, we extend the fused lasso to settings in which the points do not occur on a regular grid, leading to a method for nonparametric regression. This approach, which we call the [Formula: see text]-nearest-neighbours fused lasso, involves computing the [Formula: see text]-nearest-neighbours graph of the design points and then performing the fused lasso over this graph. We show that this procedure has a number of theoretical advantages over competing methods: specifically, it inherits local adaptivity from its connection to the fused lasso, and it inherits manifold adaptivity from its connection to the [Formula: see text]-nearest-neighbours approach. In a simulation study and an application to flu data, we show that excellent results are obtained. For completeness, we also study an estimator that makes use of an [Formula: see text]-graph rather than a [Formula: see text]-nearest-neighbours graph and contrast it with the [Formula: see text]-nearest-neighbours fused lasso.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...