RESUMO
A novel feature screening method is proposed to examine the correlation between latent responses and potential predictors in ultrahigh-dimensional data analysis. First, a confirmatory factor analysis (CFA) model is used to characterize latent responses through multiple observed variables. The expectation-maximization algorithm is employed to estimate the parameters in the CFA model. Second, R-Vector (RV) correlation is used to measure the dependence between the multivariate latent responses and covariates of interest. Third, a feature screening procedure is proposed on the basis of an unbiased estimator of the RV coefficient. The sure screening property of the proposed screening procedure is established under certain mild conditions. Monte Carlo simulations are conducted to assess the finite-sample performance of the feature screening procedure. The proposed method is applied to an investigation of the relationship between psychological well-being and the human genome.
Assuntos
Algoritmos , Genoma Humano , Humanos , Método de Monte Carlo , Análise FatorialRESUMO
BACKGROUND: With the advance of high throughput sequencing, high-dimensional data are generated. Detecting dependence/correlation between these datasets is becoming one of most important issues in multi-dimensional data integration and co-expression network construction. RNA-sequencing data is widely used to construct gene regulatory networks. Such networks could be more accurate when methylation data, copy number aberration data and other types of data are introduced. Consequently, a general index for detecting relationships between high-dimensional data is indispensable. RESULTS: We proposed a Kernel-Based RV-coefficient, named KBRV, for testing both linear and nonlinear correlation between two matrices by introducing kernel functions into RV2 (the modified RV-coefficient). Permutation test and other validation methods were used on simulated data to test the significance and rationality of KBRV. In order to demonstrate the advantages of KBRV in constructing gene regulatory networks, we applied this index on real datasets (ovarian cancer datasets and exon-level RNA-Seq data in human myeloid differentiation) to illustrate its superiority over vector correlation. CONCLUSIONS: We concluded that KBRV is an efficient index for detecting both linear and nonlinear relationships in high dimensional data. The correlation method for high dimensional data has possible applications in the construction of gene regulatory network.
Assuntos
Redes Reguladoras de Genes , Neoplasias Ovarianas , Feminino , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias Ovarianas/genética , Análise de Sequência de RNARESUMO
BACKGROUND/AIMS: Alzheimer's disease (AD) is a chronic neurodegenerative disease that causes memory loss and a decline in cognitive abilities. AD is the sixth leading cause of death in the USA, affecting an estimated 5 million Americans. To assess the association between multiple genetic variants and multiple measurements of structural changes in the brain, a recent study of AD used a multivariate measure of linear dependence, the RV coefficient. The authors decomposed the RV coefficient into contributions from individual variants and displayed these contributions graphically. METHODS: We investigate the properties of such a "contribution plot" in terms of an underlying linear model, and discuss shrinkage estimation of the components of the plot when the correlation signal may be sparse. RESULTS: The contribution plot is applied to simulated data and to genomic and brain imaging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). CONCLUSIONS: The contribution plot with shrinkage estimation can reveal truly associated explanatory variables.
Assuntos
Doença de Alzheimer/diagnóstico por imagem , Doença de Alzheimer/genética , Biomarcadores/metabolismo , Encéfalo/diagnóstico por imagem , Neuroimagem , Apolipoproteínas E/genética , Simulação por Computador , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
BACKGROUND: Red sufu is a type of sufu produced by solid-state fermentation of soybean curd and coloration with red mold rice. The purposes of this study were: (i) to characterize commercial red sufu samples using the quantitative descriptive analysis (QDA) and flash profile (FP) by ten trained and ten untrained panelists, respectively; (ii) to compare the differences in panel performance, descriptive abilities and sensory maps between the two methodologies; and (iii) to compare the efficiency between QDA and FP using red sufu as the matrix. Techniques in multivariate analysis were utilized to explore the data. RESULTS: Results from generalized procrustes analysis (GPA) showed that panel performance by QDA was more repeatable and reached higher homogeneity than that by FP. Despite the confidence ellipse results of the 12 red sufus being better discriminated by QDA, the RV coefficient was high (RV = 0.852) between the configurations of the two-dimensional model (F1 and F2) of the two methodologies, indicating that the two methods are similar and closely related. Overall, QDA provided more accurate and detailed information, while FP provided a similar sensory map on product location and descriptive results. CONCLUSION: The FP technique appeared to be an efficient alternative approach to quickly evaluate sensory properties, including appearance, flavor, aroma and textural properties of an array of red sufu products. © 2018 Society of Chemical Industry.
Assuntos
Alimentos Fermentados/análise , Alimentos de Soja/análise , Adulto , Feminino , Alimentos Fermentados/economia , Humanos , Masculino , Alimentos de Soja/economia , Paladar , Adulto JovemRESUMO
Zhan et al. () presented a kernel RV coefficient (KRV) test to evaluate the overall association between host gene expression and microbiome composition, and showed its competitive performance compared to existing methods. In this article, we clarify the close relation of KRV to the existing generalized RV (GRV) coefficient, and show that KRV and GRV have very similar performance. Although the KRV test could control the type I error rate well at 1% and 5% levels, we show that it could largely underestimate p-values at small significance levels leading to significantly inflated type I errors. As a partial remedy, we propose an alternative p-value calculation, which is efficient and more accurate than KRV p-value at small significance levels. We recommend that small KRV test p-values should always be accompanied and verified by the permutation p-value in practice. In addition, we analytically show that KRV can be written as a form of correlation coefficient, which can dramatically expedite its computation and make permutation p-value calculation more efficient.
Assuntos
Microbiota/genética , Modelos Estatísticos , Ecologia , Projetos de Pesquisa , Análise EspacialRESUMO
To fully understand the role of microbiome in human health and diseases, researchers are increasingly interested in assessing the relationship between microbiome composition and host genomic data. The dimensionality of the data as well as complex relationships between microbiota and host genomics pose considerable challenges for analysis. In this article, we apply a kernel RV coefficient (KRV) test to evaluate the overall association between host gene expression and microbiome composition. The KRV statistic can capture nonlinear correlations and complex relationships among the individual data types and between gene expression and microbiome composition through measuring general dependency. Testing proceeds via a similar route as existing tests of the generalized RV coefficients and allows for rapid p-value calculation. Strategies to allow adjustment for confounding effects, which is crucial for avoiding misleading results, and to alleviate the problem of selecting the most favorable kernel are considered. Simulation studies show that KRV is useful in testing statistical independence with finite samples given the kernels are appropriately chosen, and can powerfully identify existing associations between microbiome composition and host genomic data while protecting type I error. We apply the KRV to a microbiome study examining the relationship between host transcriptome and microbiome composition within the context of inflammatory bowel disease and are able to derive new biological insights and provide formal inference on prior qualitative observations.
Assuntos
Microbiota/genética , Modelos Estatísticos , Simulação por Computador , Interações Hospedeiro-Patógeno , Humanos , Doenças Inflamatórias Intestinais/genética , Análise Espacial , TranscriptomaRESUMO
The existing functional connectivity assessment techniques rely on different mathematical and neuro-physiological models. They may consequently provide different sets of spatial connectivity maps and associated temporal responses within their significant spatiotemporal sets of components. Note that the word component is used to generically refer to spatio-temporal pairs of maps and associated time courses. Such differences may confound the application of functional connectivity measurements in neuroscientific and clinical applications. Using several performance metrics we evaluated six fMRI resting-state connectivity measurement techniques including three fully exploratory techniques: 1) Melodic-Independent Component Analysis (ICA), 2) agnostic Canonical Variates Analysis (aCVA), and 3) generalized Canonical Correlation Analysis (gCCA); and three seed-based techniques: 1) seed gCCA (sgCCA) and 2, 3) seed Partial Least Squares (sPLS) with a posterior cingulate seed and two different time-series normalizations. We separately assessed the temporal and spatial domains for: 1) technique stability as a function of sample size using RV coefficients, and 2) subspace component similarity between pairs of techniques using CCA. Overall gCCA was the only technique that displayed high temporal and spatial stabilities, together with high spatial and temporal subspace similarities with multiple other techniques. ICA, aCVA and sgCCA tended to be the most stable spatially and produced similar spatial subspaces. All techniques produced relatively unstable and dissimilar temporal subspaces, except sPLS that produced relatively high temporal and lower spatial subspace stabilities, but with unique power-spectral Hurst coefficients ⪠1. Our results indicate that spatial maps from resting state data sets are much less dependent on the analysis technique used than are the associated time series. Such temporal variability is coupled with individual spatial component maps, which may be quite dissimilar across techniques even with similar spatial subspaces. Therefore, we suggest that consensus estimation approaches, i.e. a 2nd-level gCCA, would have great utility to produce and aid interpretation of stable results from BOLD fMRI resting state data analysis.
Assuntos
Algoritmos , Mapeamento Encefálico/métodos , Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Adulto , Encéfalo/fisiologia , Feminino , Humanos , Masculino , Vias Neurais/fisiologia , Adulto JovemRESUMO
Recently developing hierarchical community models (HCMs) accounting for incomplete sampling are promising approaches to understand community organization. However, pros and cons of incorporating incomplete sampling in the analysis and related design issues remain unknown. In this study, we compared HCM and canonical redundancy analysis (RDA) carried out with 10 different dissimilarity coefficients to evaluate how each approach restores true community abundance data sampled with imperfect detection. We conducted simulation experiments with varying numbers of sampling sites, visits, mean detectability and mean abundance. Performance of HCM was measured by estimates of "expected" (mean) abundance ( λ^ij ) and realized abundance ( N^ij : direct estimate of site- and species-specific abundance). We also compared HCM and different types of RDA (normal, partial, and weighted), all performed with the same ten different dissimilarity coefficients, with unequal number of visits to sampling sites. In addition, we applied the models to a virtual survey carried out on the Barro Colorado Island tree plot data for which we know true community abundance. Simulation experiments showed that N^ij yielded by HCM best restored the underlying abundance of constituent species among 12 abundance estimates by HCM and RDA regardless if the sampling was equal or unequal. Mean abundance predominantly affected the performance of HCM and RDA while λ^ij yielded by HCM had comparable performance to percentage difference and Gower dissimilarity coefficients of RDA. Relative performance of RDA types depended on the combination of dissimilarity coefficients and the distribution of sampling effort. Best performance of N^ij followed by λ^ij , percentage difference and Gower dissimilarity were also observed for the analysis of tree plot data, and graphical plots (triplots) based on λ^ij rather than N^ij clearly separated the effects of two environmental covariates on the abundance of constituent species. Under our conditions of model evaluation and the method, we concluded that, in terms of assessing the environmental dependence of abundance, HCMs and RDA can have comparable performance if we can choose appropriate dissimilarity coefficients for RDA. However, since HCMs provide straightforward biological interpretations of parameter estimates and flexibility of the analysis, HCMs would be useful in many situations as well as conventional canonical ordinations.
Assuntos
Modelos Biológicos , ColoradoRESUMO
Simple correlation coefficients between two variables have been generalized to measure association between two matrices in many ways. Coefficients such as the RV coefficient, the distance covariance (dCov) coefficient and kernel based coefficients are being used by different research communities. Scientists use these coefficients to test whether two random vectors are linked. Once it has been ascertained that there is such association through testing, then a next step, often ignored, is to explore and uncover the association's underlying patterns. This article provides a survey of various measures of dependence between random vectors and tests of independence and emphasizes the connections and differences between the various approaches. After providing definitions of the coefficients and associated tests, we present the recent improvements that enhance their statistical properties and ease of interpretation. We summarize multi-table approaches and provide scenarii where the indices can provide useful summaries of heterogeneous multi-block data. We illustrate these different strategies on several examples of real data and suggest directions for future research.