Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 81
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Biostatistics ; 24(2): 481-501, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-34654923

RESUMO

In recent years, a number of methods have been proposed to estimate the times at which a neuron spikes on the basis of calcium imaging data. However, quantifying the uncertainty associated with these estimated spikes remains an open problem. We consider a simple and well-studied model for calcium imaging data, which states that calcium decays exponentially in the absence of a spike, and instantaneously increases when a spike occurs. We wish to test the null hypothesis that the neuron did not spike-i.e., that there was no increase in calcium-at a particular timepoint at which a spike was estimated. In this setting, classical hypothesis tests lead to inflated Type I error, because the spike was estimated on the same data used for testing. To overcome this problem, we propose a selective inference approach. We describe an efficient algorithm to compute finite-sample $p$-values that control selective Type I error, and confidence intervals with correct selective coverage, for spikes estimated using a recent proposal from the literature. We apply our proposal in simulation and on calcium imaging data from the $\texttt{spikefinder}$ challenge.


Assuntos
Cálcio , Diagnóstico por Imagem , Humanos , Incerteza , Potenciais de Ação/fisiologia , Simulação por Computador , Algoritmos
2.
PLoS Comput Biol ; 19(10): e1011509, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37824442

RESUMO

A major goal of computational neuroscience is to build accurate models of the activity of neurons that can be used to interpret their function in circuits. Here, we explore using functional cell types to refine single-cell models by grouping them into functionally relevant classes. Formally, we define a hierarchical generative model for cell types, single-cell parameters, and neural responses, and then derive an expectation-maximization algorithm with variational inference that maximizes the likelihood of the neural recordings. We apply this "simultaneous" method to estimate cell types and fit single-cell models from simulated data, and find that it accurately recovers the ground truth parameters. We then apply our approach to in vitro neural recordings from neurons in mouse primary visual cortex, and find that it yields improved prediction of single-cell activity. We demonstrate that the discovered cell-type clusters are well separated and generalizable, and thus amenable to interpretation. We then compare discovered cluster memberships with locational, morphological, and transcriptomic data. Our findings reveal the potential to improve models of neural responses by explicitly allowing for shared functional properties across neurons.


Assuntos
Algoritmos , Neurônios , Camundongos , Animais , Simulação por Computador , Neurônios/fisiologia , Probabilidade , Modelos Neurológicos , Potenciais de Ação/fisiologia
3.
Biostatistics ; 2022 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-36511385

RESUMO

In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.

4.
Biometrics ; 78(3): 1018-1030, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-33792914

RESUMO

In this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multiview network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database. We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to cocomplex association data. We also extend this proposal to the setting of a network with node covariates. The proposed methods extend readily to three or more network/multivariate data views.


Assuntos
Algoritmos , Proteínas
5.
J R Stat Soc Series B Stat Methodol ; 84(4): 1082-1104, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-36419504

RESUMO

While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post-detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it is possible to efficiently carry out this framework in the case of changepoints estimated by binary segmentation and its variants, ℓ 0 segmentation, or the fused lasso. Our setup allows us to condition on much less information than existing approaches, which yields higher powered tests. We apply our proposals in a simulation study and on a dataset of chromosomal guanine-cytosine content. These approaches are freely available in the R package ChangepointInference at https://jewellsean.github.io/changepoint-inference/.

6.
Biostatistics ; 21(4): 692-708, 2020 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30753304

RESUMO

In the Pioneer 100 (P100) Wellness Project, multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this article, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data).


Assuntos
Algoritmos , Proteômica , Análise por Conglomerados , Humanos
7.
Biostatistics ; 21(4): 709-726, 2020 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30753436

RESUMO

Calcium imaging data promises to transform the field of neuroscience by making it possible to record from large populations of neurons simultaneously. However, determining the exact moment in time at which a neuron spikes, from a calcium imaging data set, amounts to a non-trivial deconvolution problem which is of critical importance for downstream analyses. While a number of formulations have been proposed for this task in the recent literature, in this article, we focus on a formulation recently proposed in Jewell and Witten (2018. Exact spike train inference via $\ell_{0} $ optimization. The Annals of Applied Statistics12(4), 2457-2482) that can accurately estimate not just the spike rate, but also the specific times at which the neuron spikes. We develop a much faster algorithm that can be used to deconvolve a fluorescence trace of 100 000 timesteps in less than a second. Furthermore, we present a modification to this algorithm that precludes the possibility of a "negative spike". We demonstrate the performance of this algorithm for spike deconvolution on calcium imaging datasets that were recently released as part of the $\texttt{spikefinder}$ challenge (http://spikefinder.codeneuro.org/). The algorithm presented in this article was used in the Allen Institute for Brain Science's "platform paper" to decode neural activity from the Allen Brain Observatory; this is the main scientific paper in which their data resource is presented. Our $\texttt{C++}$ implementation, along with $\texttt{R}$ and $\texttt{python}$ wrappers, is publicly available. $\texttt{R}$ code is available on $\texttt{CRAN}$ and $\texttt{Github}$, and $\texttt{python}$ wrappers are available on $\texttt{Github}$; see https://github.com/jewellsean/FastLZeroSpikeInference.


Assuntos
Cálcio , Neurônios , Algoritmos , Encéfalo/diagnóstico por imagem , Diagnóstico por Imagem , Humanos
8.
Stat Sci ; 36(4): 562-577, 2021 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37860618

RESUMO

A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.

9.
Nucleic Acids Res ; 47(D1): D886-D894, 2019 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-30371827

RESUMO

Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.


Assuntos
Bases de Dados de Ácidos Nucleicos , Variação Genética , Genoma Humano , Humanos , Aprendizado de Máquina , Anotação de Sequência Molecular
10.
Hum Brain Mapp ; 41(10): 2553-2566, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32216125

RESUMO

Brain networks are increasingly characterized at different scales, including summary statistics, community connectivity, and individual edges. While research relating brain networks to behavioral measurements has yielded many insights into brain-phenotype relationships, common analytical approaches only consider network information at a single scale. Here, we designed, implemented, and deployed Multi-Scale Network Regression (MSNR), a penalized multivariate approach for modeling brain networks that explicitly respects both edge- and community-level information by assuming a low rank and sparse structure, both encouraging less complex and more interpretable modeling. Capitalizing on a large neuroimaging cohort (n = 1, 051), we demonstrate that MSNR recapitulates interpretable and statistically significant connectivity patterns associated with brain development, sex differences, and motion-related artifacts. Compared to single-scale methods, MSNR achieves a balance between prediction performance and model complexity, with improved interpretability. Together, by jointly exploiting both edge- and community-level information, MSNR has the potential to yield novel insights into brain-behavior relationships.


Assuntos
Encéfalo/fisiologia , Conectoma/métodos , Imageamento por Ressonância Magnética/métodos , Modelos Estatísticos , Rede Nervosa/fisiologia , Adolescente , Encéfalo/diagnóstico por imagem , Estudos Transversais , Feminino , Humanos , Individualidade , Masculino , Rede Nervosa/diagnóstico por imagem , Fenótipo , Análise de Regressão , Caracteres Sexuais
11.
Genome Res ; 27(1): 38-52, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27831498

RESUMO

Candidate enhancers can be identified on the basis of chromatin modifications, the binding of chromatin modifiers and transcription factors and cofactors, or chromatin accessibility. However, validating such candidates as bona fide enhancers requires functional characterization, typically achieved through reporter assays that test whether a sequence can increase expression of a transcriptional reporter via a minimal promoter. A longstanding concern is that reporter assays are mainly implemented on episomes, which are thought to lack physiological chromatin. However, the magnitude and determinants of differences in cis-regulation for regulatory sequences residing in episomes versus chromosomes remain almost completely unknown. To address this systematically, we developed and applied a novel lentivirus-based massively parallel reporter assay (lentiMPRA) to directly compare the functional activities of 2236 candidate liver enhancers in an episomal versus a chromosomally integrated context. We find that the activities of chromosomally integrated sequences are substantially different from the activities of the identical sequences assayed on episomes, and furthermore are correlated with different subsets of ENCODE annotations. The results of chromosomally based reporter assays are also more reproducible and more strongly predictable by both ENCODE annotations and sequence-based models. With a linear model that combines chromatin annotations and sequence information, we achieve a Pearson's R2 of 0.362 for predicting the results of chromosomally integrated reporter assays. This level of prediction is better than with either chromatin annotations or sequence information alone and also outperforms predictive models of episomal assays. Our results have broad implications for how cis-regulatory elements are identified, prioritized and functionally validated.


Assuntos
Cromatina/genética , Elementos Facilitadores Genéticos/genética , Regulação da Expressão Gênica/genética , Plasmídeos/genética , Montagem e Desmontagem da Cromatina/genética , Cromossomos/genética , Genes Reporter , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Regiões Promotoras Genéticas , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição
12.
Stat Med ; 38(4): 583-600, 2019 02 20.
Artigo em Inglês | MEDLINE | ID: mdl-30010200

RESUMO

In this paper, we consider fitting a flexible and interpretable additive regression model in a data-rich setting. We wish to avoid pre-specifying the functional form of the conditional association between each covariate and the response, while still retaining interpretability of the fitted functions. A number of recent proposals in the literature for nonparametric additive modeling are data adaptive, in the sense that they can adjust the level of flexibility in the functional fits to the data at hand. For instance, the sparse additive model makes it possible to adaptively determine which features should be included in the fitted model, the sparse partially linear additive model allows each feature in the fitted model to take either a linear or a nonlinear functional form, and the recent fused lasso additive model and additive trend filtering proposals allow the knots in each nonlinear function fit to be selected from the data. In this paper, we combine the strengths of each of these recent proposals into a single proposal that uses the data to determine which features to include in the model, whether to model each feature linearly or nonlinearly, and what form to use for the nonlinear functions. We establish connections between our approach and recent proposals from the literature, and we demonstrate its strengths in a simulation study.


Assuntos
Interpretação Estatística de Dados , Modelos Estatísticos , Humanos , Dinâmica não Linear , Análise de Regressão , Estatísticas não Paramétricas
13.
Biostatistics ; 18(1): 147-164, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27496912

RESUMO

Genomic phenotypes, such as DNA methylation and chromatin accessibility, can be used to characterize the transcriptional and regulatory activity of DNA within a cell. Recent technological advances have made it possible to measure such phenotypes very densely. This density often results in spatial structure, in the sense that measurements at nearby sites are very similar. In this article, we consider the task of comparing genomic phenotypes across experimental conditions, cell types, or disease subgroups. We propose a new method, Joint Adaptive Differential Estimation (JADE), which leverages the spatial structure inherent to genomic phenotypes. JADE simultaneously estimates smooth underlying group average genomic phenotype profiles and detects regions in which the average profile differs between groups. We evaluate JADE's performance in several biologically plausible simulation settings. We also consider an application to the detection of regions with differential methylation between mature skeletal muscle cells, myotubes, and myoblasts.


Assuntos
Metilação de DNA/genética , Genoma/genética , Modelos Genéticos , Modelos Estatísticos , Fenótipo , Humanos , Fibras Musculares Esqueléticas/metabolismo , Mioblastos Esqueléticos/metabolismo
14.
Biostatistics ; 17(4): 677-91, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27044327

RESUMO

In a multivariate setting, we consider the task of identifying features whose correlations with the other features differ across conditions. Such correlation shifts may occur independently of mean shifts, or differences in the means of the individual features across conditions. Previous approaches for detecting correlation shifts consider features simultaneously, by computing a correlation-based test statistic for each feature. However, since correlations involve two features, such approaches do not lend themselves to identifying which feature is the culprit. In this article, we instead consider a serial testing approach, by comparing columns of the sample correlation matrix across two conditions, and removing one feature at a time. Our method provides a novel perspective and favorable empirical results compared with competing approaches.


Assuntos
Bioestatística/métodos , Interpretação Estatística de Dados , Modelos Teóricos , Projetos de Pesquisa , Humanos
16.
PLoS Comput Biol ; 10(7): e1003703, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25010360

RESUMO

Cancers arise from successive rounds of mutation and selection, generating clonal populations that vary in size, mutational content and drug responsiveness. Ascertaining the clonal composition of a tumor is therefore important both for prognosis and therapy. Mutation counts and frequencies resulting from next-generation sequencing (NGS) potentially reflect a tumor's clonal composition; however, deconvolving NGS data to infer a tumor's clonal structure presents a major challenge. We propose a generative model for NGS data derived from multiple subsections of a single tumor, and we describe an expectation-maximization procedure for estimating the clonal genotypes and relative frequencies using this model. We demonstrate, via simulation, the validity of the approach, and then use our algorithm to assess the clonal composition of a primary breast cancer and associated metastatic lymph node. After dividing the tumor into subsections, we perform exome sequencing for each subsection to assess mutational content, followed by deep sequencing to precisely count normal and variant alleles within each subsection. By quantifying the frequencies of 17 somatic variants, we demonstrate that our algorithm predicts clonal relationships that are both phylogenetically and spatially plausible. Applying this method to larger numbers of tumors should cast light on the clonal evolution of cancers in space and time.


Assuntos
Neoplasias da Mama/classificação , Neoplasias da Mama/genética , Biologia Computacional/métodos , Algoritmos , Neoplasias da Mama/metabolismo , Simulação por Computador , Feminino , Genótipo , Humanos , Filogenia
17.
Comput Stat Data Anal ; 85: 23-36, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25642008

RESUMO

The task of estimating a Gaussian graphical model in the high-dimensional setting is considered. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to a lasso penalty, is a well-studied approach for this task. A surprising connection between the graphical lasso and hierarchical clustering is introduced: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) a penalized log likelihood is maximized on the subset of variables within each connected component. Thus, the graphical lasso determines the connected components of the estimated network via single linkage clustering. The single linkage clustering is known to perform poorly in certain finite-sample settings. Therefore, the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster, is proposed. Model selection consistency for this technique is established, and its improved performance relative to the graphical lasso is demonstrated in a simulation study, as well as in applications to a university webpage and a gene expression data sets.

18.
Nucleic Acids Res ; 40(9): 3849-55, 2012 May.
Artigo em Inglês | MEDLINE | ID: mdl-22266657

RESUMO

A growing body of experimental evidence supports the hypothesis that the 3D structure of chromatin in the nucleus is closely linked to important functional processes, including DNA replication and gene regulation. In support of this hypothesis, several research groups have examined sets of functionally associated genomic loci, with the aim of determining whether those loci are statistically significantly colocalized. This work presents a critical assessment of two previously reported analyses, both of which used genome-wide DNA-DNA interaction data from the yeast Saccharomyces cerevisiae, and both of which rely upon a simple notion of the statistical significance of colocalization. We show that these previous analyses rely upon a faulty assumption, and we propose a correct non-parametric resampling approach to the same problem. Applying this approach to the same data set does not support the hypothesis that transcriptionally coregulated genes tend to colocalize, but strongly supports the colocalization of centromeres, and provides some evidence of colocalization of origins of early DNA replication, chromosomal breakpoints and transfer RNAs.


Assuntos
Componentes Genômicos , Genômica/métodos , Saccharomyces cerevisiae/genética , Interpretação Estatística de Dados , Regulação Fúngica da Expressão Gênica , Genes Fúngicos , Genoma Fúngico , Estatísticas não Paramétricas , Transcrição Gênica
19.
J Am Stat Assoc ; 119(545): 332-342, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38660582

RESUMO

Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.

20.
bioRxiv ; 2024 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-39005417

RESUMO

The central amygdala (CeA) has emerged as an important brain region for regulating both negative (fear and anxiety) and positive (reward) affective behaviors. The CeA has been proposed to encode affective information in the form of valence (whether the stimulus is good or bad) or salience (how significant is the stimulus), but the extent to which these two types of stimulus representation occur in the CeA is not known. Here, we used single cell calcium imaging in mice during appetitive and aversive conditioning and found that majority of CeA neurons (∼65%) encode the valence of the unconditioned stimulus (US) with a smaller subset of cells (∼15%) encoding the salience of the US. Valence and salience encoding of the conditioned stimulus (CS) was also observed, albeit to a lesser extent. These findings show that the CeA is a site of convergence for encoding oppositely valenced US information.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA