Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Ann Appl Probab ; 32(4): 2967-3003, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-36034074

RESUMO

We study the sample covariance matrix for real-valued data with general population covariance, as well as MANOVA-type covariance estimators in variance components models under null hypotheses of global sphericity. In the limit as matrix dimensions increase proportionally, the asymptotic spectra of such estimators may have multiple disjoint intervals of support, possibly intersecting the negative half line. We show that the distribution of the extremal eigenvalue at each regular edge of the support has a GOE Tracy-Widom limit. Our proof extends a comparison argument of Ji Oon Lee and Kevin Schnelli, replacing a continuous Green function flow by a discrete Lindeberg swapping scheme.

2.
Stat Sin ; 31(2): 571-601, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-33833489

RESUMO

Sample correlation matrices are widely used, but for high-dimensional data little is known about their spectral properties beyond "null models", which assume the data have independent coordinates. In the class of spiked models, we apply random matrix theory to derive asymptotic first-order and distributional results for both leading eigenvalues and eigenvectors of sample correlation matrices, assuming a high-dimensional regime in which the ratio p/n, of number of variables p to sample size n, converges to a positive constant. While the first-order spectral properties of sample correlation matrices match those of sample covariance matrices, their asymptotic distributions can differ significantly. Indeed, the correlation-based fluctuations of both sample eigenvalues and eigenvectors are often remarkably smaller than those of their sample covariance counterparts.

3.
Ann Stat ; 47(5): 2855-2886, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31462837

RESUMO

We study the spectra of MANOVA estimators for variance component covariance matrices in multivariate random effects models. When the dimensionality of the observations is large and comparable to the number of realizations of each random effect, we show that the empirical spectra of such estimators are well-approximated by deterministic laws. The Stieltjes transforms of these laws are characterized by systems of fixed-point equations, which are numerically solvable by a simple iterative procedure. Our proof uses operator-valued free probability theory, and we establish a general asymptotic freeness result for families of rectangular orthogonally-invariant random matrices, which is of independent interest. Our work is motivated in part by the estimation of components of covariance between multiple phenotypic traits in quantitative genetics, and we specialize our results to common experimental designs that arise in this application.

4.
Proc IEEE Inst Electr Electron Eng ; 106(8): 1277-1292, 2018 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-30287970

RESUMO

When the data are high dimensional, widely used multivariate statistical methods such as principal component analysis can behave in unexpected ways. In settings where the dimension of the observations is comparable to the sample size, upward bias in sample eigenvalues and inconsistency of sample eigenvectors are among the most notable phenomena that appear. These phenomena, and the limiting behavior of the rescaled extreme sample eigenvalues, have recently been investigated in detail under the spiked covariance model. The behavior of the bulk of the sample eigenvalues under weak distributional assumptions on the observations has been described. These results have been exploited to develop new estimation and hypothesis testing methods for the population covariance matrix. Furthermore, partly in response to these phenomena, alternative classes of estimation procedures have been developed by exploiting sparsity of the eigenvectors or the covariance matrix. This paper gives an orientation to these areas.

5.
Ann Stat ; 46(4): 1742-1778, 2018 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-30258255

RESUMO

We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker η that acts elementwise on the sample eigenvalues. Indeed, to each loss function there corresponds a unique admissible eigenvalue shrinker η* dominating all other shrinkers. The shape of the optimal shrinker is determined by the choice of loss function and, crucially, by inconsistency of both eigenvalues and eigenvectors of the sample covariance matrix. Details of these phenomena and closed form formulas for the optimal eigenvalue shrinkers are worked out for a menagerie of 26 loss functions for covariance estimation found in the literature, including the Stein, Entropy, Divergence, Fréchet, Bhattacharya/Matusita, Frobenius Norm, Operator Norm, Nuclear Norm and Condition Number losses.

6.
Stat Sin ; 28(4): 2541-2564, 2018 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30886511

RESUMO

We study improved approximations to the distribution of the largest eigenvalue ℓ ^ of the sample covariance matrix of n zero-mean Gaussian observations in dimension p + 1. We assume that one population principal component has variance ℓ > 1 and the remaining 'noise' components have common variance 1. In the high-dimensional limit p/n → γ > 0, we study Edgeworth corrections to the limiting Gaussian distribution of ℓ ^ in the supercritical case ℓ ​ > 1 + γ . The skewness correction involves a quadratic polynomial, as in classical settings, but the coefficients reflect the high-dimensional structure. The methods involve Edgeworth expansions for sums of independent non-identically distributed variates obtained by conditioning on the sample noise eigenvalues, and the limiting bulk properties and fluctuations of these noise eigenvalues.

7.
Ann Stat ; 43(3): 937-961, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26448678

RESUMO

We consider estimating the predictive density under Kullback-Leibler loss in an ℓ0 sparse Gaussian sequence model. Explicit expressions of the first order minimax risk along with its exact constant, asymptotically least favorable priors and optimal predictive density estimates are derived. Compared to the sparse recovery results involving point estimation of the normal mean, new decision theoretic phenomena are seen. Suboptimal performance of the class of plug-in density estimates reflects the predictive nature of the problem and optimal strategies need diversification of the future risk. We find that minimax optimal strategies lie outside the Gaussian family but can be constructed with threshold predictive density estimates. Novel minimax techniques involving simultaneous calibration of the sparsity adjustment and the risk diversification mechanisms are used to design optimal predictive density estimates.

8.
Proc Natl Acad Sci U S A ; 109(50): 20667-72, 2012 Dec 11.
Artigo em Inglês | MEDLINE | ID: mdl-23188796

RESUMO

"Bulk" measurements of antiviral innate immune responses from pooled cells yield averaged signals and do not reveal underlying signaling heterogeneity in infected and bystander single cells. We examined such heterogeneity in the small intestine during rotavirus (RV) infection. Murine RV EW robustly activated type I IFNs and several antiviral genes (IFN-stimulated genes) in the intestine by bulk analysis, the source of induced IFNs primarily being hematopoietic cells. Flow cytometry and microfluidics-based single-cell multiplex RT-PCR allowed dissection of IFN responses in single RV-infected and bystander intestinal epithelial cells (IECs). EW replicates in IEC subsets differing in their basal type I IFN transcription and induces IRF3-dependent and IRF3-augmented transcription, but not NF-κB-dependent or type I IFN transcripts. Bystander cells did not display enhanced type I IFN transcription but had elevated levels of certain IFN-stimulated genes, presumably in response to exogenous IFNs secreted from immune cells. Comparison of IRF3 and NF-κB induction in STAT1(-/-) mice revealed that murine but not simian RRV mediated accumulation of IkB-α protein and decreased transcription of NF-κB-dependent genes. RRV replication was significantly rescued in IFN types I and II, as well as STAT1 (IFN types I, II, and III) deficient mice in contrast to EW, which was only modestly sensitive to IFNs I and II. Resolution of "averaged" innate immune responses in single IECs thus revealed unexpected heterogeneity in both the induction and subversion of early host antiviral immunity, which modulated host range.


Assuntos
Mucosa Intestinal/imunologia , Mucosa Intestinal/virologia , Infecções por Rotavirus/imunologia , Infecções por Rotavirus/virologia , Animais , Imunidade Inata/genética , Fator Regulador 3 de Interferon/imunologia , Interferon Tipo I/biossíntese , Mucosa Intestinal/metabolismo , Intestino Delgado/imunologia , Intestino Delgado/metabolismo , Intestino Delgado/virologia , Camundongos , Camundongos da Linhagem 129 , Receptores de Interferon/metabolismo , Rotavirus/imunologia , Rotavirus/patogenicidade , Infecções por Rotavirus/genética , Infecções por Rotavirus/metabolismo , Fator de Transcrição STAT1/metabolismo
9.
Biostatistics ; 13(3): 523-38, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22003245

RESUMO

We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.


Assuntos
Interpretação Estatística de Dados , Modelos Estatísticos , RNA Mensageiro/genética , Análise de Sequência de DNA/métodos , Humanos , RNA Mensageiro/química , Reação em Cadeia da Polimerase Via Transcriptase Reversa
10.
Ann Stat ; 41(3): 1055-1084, 2013 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25324581

RESUMO

We study the problem of estimating the leading eigenvectors of a high-dimensional population covariance matrix based on independent Gaussian observations. We establish a lower bound on the minimax risk of estimators under the l2 loss, in the joint limit as dimension and sample size increase to infinity, under various models of sparsity for the population eigenvectors. The lower bound on the risk points to the existence of different regimes of sparsity of the eigenvectors. We also propose a new method for estimating the eigenvectors by a two-stage coordinate selection scheme.

11.
Ann Appl Probab ; 22(5): 1962-1988, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23667298

RESUMO

We study the rate of convergence for the largest eigenvalue distributions in the Gaussian unitary and orthogonal ensembles to their Tracy-Widom limits. We show that one can achieve an O(N-2/3) rate with particular choices of the centering and scaling constants. The arguments here also shed light on more complicated cases of Laguerre and Jacobi ensembles, in both unitary and orthogonal versions. Numerical work shows that the suggested constants yield reasonable approximations even for suprisingly small values of N.

12.
Ann Stat ; 36(6): 2638, 2008 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-20157626

RESUMO

Let A and B be independent, central Wishart matrices in p variables with common covariance and having m and n degrees of freedom, respectively. The distribution of the largest eigenvalue of (A + B)(-1)B has numerous applications in multivariate statistics, but is difficult to calculate exactly. Suppose that m and n grow in proportion to p. We show that after centering and, scaling, the distribution is approximated to second-order, O(p(-2/3)), by the Tracy-Widom law. The results are obtained for both complex and then real-valued data by using methods of random matrix theory to study the largest eigenvalue of the Jacobi unitary and orthogonal ensembles. Asymptotic approximations of Jacobi polynomials near the largest zero play a central role.

13.
Aust N Z J Stat ; 60(1): 65-74, 2018 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-30140164

RESUMO

Consider the classical Gaussian unitary ensemble of size N and the real white Wishart ensemble with N variables and n degrees of freedom. In the limits of large N and n, with positive ratio γ in the Wishart case, the expected number of eigenvalues that exit the upper bulk edge is less than one, approaching 0.031 and 0.170 respectively, the latter number being independent of γ. These statements are consequences of quantitative bounds on tail sums of eigenvalues outside the bulk which are established here for applications in high dimensional covariance matrix estimation.

14.
Stat ; 3(1): 240-249, 2014 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-25221357

RESUMO

The classical methods of multivariate analysis are based on the eigenvalues of one or two sample covariance matrices. In many applications of these methods, for example to high dimensional data, it is natural to consider alternative hypotheses which are a low rank departure from the null hypothesis. For rank one alternatives, this note provides a representation for the joint eigenvalue density in terms of a single contour integral. This will be of use for deriving approximate distributions for likelihood ratios and 'linear' statistics used in testing.

15.
Inst Math Stat Collect ; 6: 87-98, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-21113327

RESUMO

In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension p increases with sample size n: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.

16.
Ann Appl Stat ; 3(4): 1616-1633, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-20526465

RESUMO

The greatest root distribution occurs everywhere in classical multivariate analysis, but even under the null hypothesis the exact distribution has required extensive tables or special purpose software. We describe a simple approximation, based on the Tracy-Widom distribution, that in many cases can be used instead of tables or software, at least for initial screening. The quality of approximation is studied, and its use illustrated in a variety of setttings.

17.
Philos Trans A Math Phys Eng Sci ; 367(1906): 4237-53, 2009 Nov 13.
Artigo em Inglês | MEDLINE | ID: mdl-19805443

RESUMO

Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.


Assuntos
Estatística como Assunto/métodos , Teorema de Bayes , Modelos Lineares
18.
J Am Stat Assoc ; 104(486): 682-693, 2009 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-20617121

RESUMO

Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n→0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ⪢ n.

19.
Stat Med ; 21(19): 2879-88, 2002 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-12325104

RESUMO

Economic endpoints have been increasingly included in long-term clinical trials, but they pose several methodologic challenges, including how best to collect, describe, analyse and interpret medical cost data. Cost of care can be measured by converting billed charges, performing detailed micro-costing studies, or by measuring use of key resources and assigning cost weights to each resource. The latter method is most commonly used, with cost weights based either on empirical regression models or administratively determined reimbursement rates. In long-term studies, monetary units should be adjusted to reflect cost inflation and discounting. The temporal pattern of accumulating costs can be described using a modification of the Kaplan-Meier curve. Regression analyses to evaluate factors associated with cost are best performed on the log of costs due to their typically skewed distribution.Cost-effectiveness analysis attempts to measure the value of a new therapy by calculating the difference in cost between the new therapy and the standard therapy, divided by the difference in benefit between the new therapy and the standard therapy. The cost-effectiveness ratio based on the results of a randomized trial may change substantially with longer follow-up intervals, particularly for therapies that are initially expensive but eventually improve survival. A model that projects long-term patterns of cost and survival expected beyond the end of completed follow-up can provide an important perspective in the setting of limited trial duration.


Assuntos
Modelos Econômicos , Ensaios Clínicos Controlados Aleatórios como Assunto/economia , Angioplastia Coronária com Balão/economia , Doenças Cardiovasculares/economia , Doenças Cardiovasculares/cirurgia , Ponte de Artéria Coronária/economia , Análise Custo-Benefício , Interpretação Estatística de Dados , Humanos , Estudos Longitudinais , Anos de Vida Ajustados por Qualidade de Vida
20.
J Urol ; 167(1): 103-11, 2002 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-11743285

RESUMO

PURPOSE: Serum prostate specific antigen (PSA) is widely used as a guide to initiate prostatic biopsies and to follow men older than 50 years old with and without prostate cancer. However, benign prostatic hyperplasia (BPH) is a common cause of serum PSA values between 2 and 10 ng./ml. A better understanding of the relationships among serum PSA, prostate cancer and BPH is important. MATERIALS AND METHODS: A total of 875 men underwent radical prostatectomy at our institution between December 1984 and January 1997. Of these men 784 had a serum PSA of 2 to 22 ng./ml., including 579 with the largest cancer located in the peripheral zone of the prostate. Of the 579 men 406 had serum PSA followups for greater than 3 years after radical prostatectomy. We examined Pearson correlations (R2) between preoperative serum PSA, and the volume of Gleason grades 4/5 and 3 to 1 cancer in 784 men, separating peripheral zone from transition zone cancers. We used broken line regression with break points of 7 and 9 ng./ml. preoperative PSA to summarize the relationship of each PSA doubling to 5 different morphological variables in 579 men with peripheral zone cancer. A 9 ng./ml. break point was used for prostate weight. Trend summaries with a local regression line for the relationships between 6 morphological variables and PSA were superimposed on full scatterplots of the 579 men with PSA less than 22 ng./ml. Cox proportional hazard models were used to examine 5-year PSA failure-free probabilities based on 406 men with minimal PSA followups greater than 3 years at break points of 7 to 9 ng./ml. PSA. RESULTS: Pearson correlation between cancer volume and preoperative serum PSA in 875 men was weak (r2 = 0.27) and driven by large cancers with serum PSA greater than 22 ng./ml. For peripheral zone cancer the overall R2 x 100 for 641 men with low and high grade cancer was 10% and only 3% for low grade cancer, that is almost no PSA produced by these peripheral zone cancers enters the serum. All morphological variables changed at rates of doubtful medical significance below a PSA of 7 to 9 ng./ml. but at rates that were significantly worse above 9 ng./ml. R2 for these relationships was never greater than 15%. Large individual morphological variations at all levels of PSA emphasize the serious limitation of PSA as a predictor of prostate cancer morphology. Below 9 ng./ml. prostate weight increased by 21% for each doubling of PSA but above 9 ng./ml. the increase was only 4.8%. CONCLUSIONS: Preoperative serum PSA has a clinically useless relationship with cancer volume and grade in radical prostatectomy specimens, and a limited relationship with PSA cure rates at preoperative serum PSA levels of 2 to 9 ng./ml. Trend summaries for prostate weight on broken line regression showed that below 9 ng./ml. BPH is a strong contender for the cause of PSA elevation, constituting the primary cause of the over diagnosis of prostate cancer.


Assuntos
Biomarcadores Tumorais/sangue , Antígeno Prostático Específico/sangue , Prostatectomia , Neoplasias da Próstata/diagnóstico , Fatores Etários , Seguimentos , Humanos , Masculino , Pessoa de Meia-Idade , Modelos de Riscos Proporcionais , Hiperplasia Prostática/sangue , Neoplasias da Próstata/patologia , Neoplasias da Próstata/cirurgia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA