Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Proc Natl Acad Sci U S A ; 119(34): e2205518119, 2022 08 23.
Artigo em Inglês | MEDLINE | ID: mdl-35969737

RESUMO

Testing the significance of predictors in a regression model is one of the most important topics in statistics. This problem is especially difficult without any parametric assumptions on the data. This paper aims to test the null hypothesis that given confounding variables Z, X does not significantly contribute to the prediction of Y under the model-free setting, where X and Z are possibly high dimensional. We propose a general framework that first fits nonparametric machine learning regression algorithms on [Formula: see text] and [Formula: see text], then compares the prediction power of the two models. The proposed method allows us to leverage the strength of the most powerful regression algorithms developed in the modern machine learning community. The P value for the test can be easily obtained by permutation. In simulations, we find that the proposed method is more powerful compared to existing methods. The proposed method allows us to draw biologically meaningful conclusions from two gene expression data analyses without strong distributional assumptions: 1) testing the prediction power of sequencing RNA for the proteins in cellular indexing of transcriptomes and epitopes by sequencing data and 2) identification of spatially variable genes in spatially resolved transcriptomics data.


Assuntos
Genômica , Aprendizado de Máquina , Algoritmos , Análise de Regressão , Transcriptoma
2.
Biostatistics ; 2022 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-36511385

RESUMO

In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.

3.
J Equine Sci ; 34(2): 21-27, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37405066

RESUMO

Gene doping, which is prohibited in horseracing and equestrian sports, can be performed by introducing exogenous genes, known as transgenes, into the bodies of postnatal animals. To detect exogenous genes, a method utilizing quantitative polymerase chain reaction (qPCR) with a hydrolysis probe was developed to test whole blood and plasma samples, thereby protecting the fairness of competition and the rights of stakeholders in horseracing and equestrian sports. Therefore, we aimed to develop sample storage methods suitable for A and B samples in gene doping tests using blood. For sample A, sufficient qPCR detection was demonstrated after refrigeration for 1 to 2 weeks post collection. For sample B, the following procedures were confirmed to be suitable for storage: 1) centrifugation after sample receipt, 2) frozen storage, 3) natural thawing at room temperature, and 4) centrifugation without mixing blood cell components. Our results indicated that long-term cryopreservation yielded good plasma components from frozen blood samples even though it destroyed blood cells, indicating its applicability to the gene doping test using sample B, which can be stored for later use. Sample storage procedures are as important as detection methods in doping tests. Therefore, the series of procedures that we evaluated in this study will contribute to the efficient performance of gene doping tests through qPCR using blood samples.

4.
Behav Res Methods ; 54(6): 2665-2677, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34918226

RESUMO

Nowadays, exploratory and confirmatory factor analyses are two important consecutive steps in an overall analysis process. The overall analysis should start with an exploratory factor analysis that explores the data and establishes a hypothesis for the factor model in the population. Then, the analysis process should be continued with a confirmatory factor analysis to assess whether the hypothesis proposed in the exploratory step is plausible in the population. To carry out the analysis, researchers usually collect a single sample, and then split it into two halves. As no specific splitting methods have been proposed to date in the context of factor analysis, researchers use a random split approach. In this paper we propose a method to split samples into equivalent subsamples similar to one that has already been proposed in the context of multivariate regression analysis. The method was tested in simulation studies and in real datasets.


Assuntos
Análise Fatorial , Humanos
5.
Am J Epidemiol ; 190(8): 1483-1487, 2021 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-33751059

RESUMO

In this issue of the Journal, Mooney et al. (Am J Epidemiol. 2021;190(8):1476-1482) discuss machine learning as a tool for causal research in the style of Internet headlines. Here we comment by adapting famous literary quotations, including the one in our title (from "Sonnet 43" by Elizabeth Barrett Browning (Sonnets From the Portuguese, Adelaide Hanscom Leeson, 1850)). We emphasize that any use of machine learning to answer causal questions must be founded on a formal framework for both causal and statistical inference. We illustrate the pitfalls that can occur without such a foundation. We conclude with some practical recommendations for integrating machine learning into causal analyses in a principled way and highlight important areas of ongoing work.


Assuntos
Amor , Aprendizado de Máquina , Causalidade , Humanos
6.
Stat Med ; 40(22): 4872-4889, 2021 09 30.
Artigo em Inglês | MEDLINE | ID: mdl-34121214

RESUMO

Clinical trials require substantial effort and time to complete, and regulatory agencies may require two successful efficacy trials before approving a new drug. One way to improve the chance of follow-up success is to identify a subpopulation among whom treatment effects are estimated to be beneficial, and enrolling future studies from this subpopulation. In this article we study confirmable responder class (CRC) learning, where the objective is to learn in a random half of the dataset (training set) a subpopulation among whom the predicted conditional ATE (CATE) suggests clinically meaningful benefit, with maximum power of being confirmed via hypothesis test in the other half (test set). We studied a set of CRC learners across simulated datasets in which either all patients benefited, or only some did. Performance metrics included the rate of confirmation in the test set, and the classification accuracy of the CRC compared with the group with true CATE>0. In trials where all patients benefitted, only two learners were able to consistently identify most of the population as the CRC. One of these also performed especially well when only some patients benefitted, having relatively high confirmation rates, and showing robustness to the dimension of the covariate vector and population characteristics. This learner is based on cross-validation, and is a possible avenue for further development of model selection criteria for CRC learning. We also show that the performance of all methods can be improved by using both halves of the original dataset as training and test sets in turn.


Assuntos
Aprendizagem , Aprendizado de Máquina , Ensaios Clínicos como Assunto , Humanos , Seleção de Pacientes
7.
J Econom ; 215(1): 118-130, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32773919

RESUMO

This paper develops a new estimation procedure for ultrahigh dimensional sparse precision matrix, the inverse of covariance matrix. Regularization methods have been proposed for sparse precision matrix estimation, but they may not perform well with ultrahigh dimensional data due to the spurious correlation. We propose a refitted cross validation (RCV) method for sparse precision matrix estimation based on its Cholesky decomposition, which does not require the Gaussian assumption. The proposed RCV procedure can be easily implemented with existing software for ultrahigh dimensional linear regression. We establish the consistency of the proposed RCV estimation and show that the rate of convergence of the RCV estimation without assuming banded structure is the same as that of those assuming the banded structure in Bickel and Levina (2008b). Monte Carlo studies were conducted to access the finite sample performance of the RCV estimation. Our numerical comparison shows that the RCV estimation outperforms the existing ones in various scenarios. We further apply the RCV estimation for an empirical analysis of asset allocation.

8.
Genes (Basel) ; 13(12)2022 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-36553532

RESUMO

The advances in high-throughput sequencing (HTS) have enabled the characterisation of biological processes at an unprecedented level of detail; most hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains a main challenge. Although variability in results may be introduced at various stages, e.g., alignment, summarisation or detection of differential expression, one source of variability was systematically omitted: the sequencing design, which propagates through analyses and may introduce an additional layer of technical variation. We illustrate qualitative and quantitative differences arising from splitting samples across lanes on bulk and single-cell sequencing. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling and the peaks' properties. At the single-cell level, we concentrate on identifying cell subpopulations. We rely on markers used for assigning cell identities; both smartSeq and 10× data are presented. The observed reduction in the number of unique sequenced fragments limits the level of detail on which the different prediction approaches depend. Furthermore, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths and (yet unexplained) sequencing bias. Subsequently, we observe an overall reduction in sequencing complexity and a distortion in the biological signal across technologies, experimental contexts, organisms and tissues.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Reprodutibilidade dos Testes , Sequenciamento de Nucleotídeos em Larga Escala/métodos
9.
J Am Stat Assoc ; 116(534): 746-755, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36776718

RESUMO

Different from traditional intra-subject analysis, the goal of inter-subject analysis (ISA) is to explore the dependency structure between different subjects with the intra-subject dependency as nuisance. ISA has important applications in neuroscience to study the functional connectivity between brain regions under natural stimuli. We propose a modeling framework for ISA that is based on Gaussian graphical models, under which ISA can be converted to the problem of estimation and inference of a partial Gaussian graphical model. The main statistical challenge is that we do not impose sparsity constraints on the whole precision matrix and we only assume the inter-subject part is sparse. For estimation, we propose to estimate an alternative parameter to get around the nonsparse issue and it can achieve asymptotic consistency even if the intra-subject dependency is dense. For inference, we propose an "untangle and chord" procedure to de-bias our estimator. It is valid without the sparsity assumption on the inverse Hessian of the log-likelihood function. This inferential method is general and can be applied to many other statistical problems, thus it is of independent theoretical interest. Numerical experiments on both simulated and brain imaging data validate our methods and theory. Supplementary materials for this article are available online.

10.
J R Stat Soc Series B Stat Methodol ; 78(1): 193-210, 2016 01.
Artigo em Inglês | MEDLINE | ID: mdl-27656104

RESUMO

We consider a high dimensional regression model with a possible change point due to a covariate threshold and develop the lasso estimator of regression coefficients as well as the threshold parameter. Our lasso estimator not only selects covariates but also selects a model between linear and threshold regression models. Under a sparsity assumption, we derive non-asymptotic oracle inequalities for both the prediction risk and the l1-estimation loss for regression coefficients. Since the lasso estimator selects variables simultaneously, we show that oracle inequalities can be established without pretesting the existence of the threshold effect. Furthermore, we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a factor that is nearly n-1 even when the number of regressors can be much larger than the sample size n. We illustrate the usefulness of our proposed estimation method via Monte Carlo simulations and an application to real data.

11.
J Multivar Anal ; 126: 153-166, 2014 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25076800

RESUMO

Many multiple testing procedures make use of the p-values from the individual pairs of hypothesis tests, and are valid if the p-value statistics are independent and uniformly distributed under the null hypotheses. However, it has recently been shown that these types of multiple testing procedures are inefficient since such p-values do not depend upon all of the available data. This paper provides tools for constructing compound p-value statistics, which are those that depend upon all of the available data, but still satisfy the conditions of independence and uniformity under the null hypotheses. Several examples are provided, including a class of compound p-value statistics for testing location shifts. It is demonstrated, both analytically and through simulations, that multiple testing procedures tend to reject more false null hypotheses when applied to these compound p-values rather than the usual p-values, and at the same time still guarantee the desired type I error rate control. The compound p-values are used to analyze a real microarray data set and allow for more rejected null hypotheses.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA