Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
Nat Methods ; 21(5): 835-845, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38374265

RESUMEN

Modern multiomic technologies can generate deep multiscale profiles. However, differences in data modalities, multicollinearity of the data, and large numbers of irrelevant features make analyses and integration of high-dimensional omic datasets challenging. Here we present Significant Latent Factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets. SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, and has rigorous false discovery rate control. Using SLIDE on single-cell and spatial omic datasets, we uncovered significant interacting latent factors underlying a range of molecular, cellular and organismal phenotypes. SLIDE outperforms/performs at least as well as a wide range of state-of-the-art approaches, including other latent factor approaches. More importantly, it provides biological inference beyond prediction that other methods do not afford. Thus, SLIDE is a versatile engine for biological discovery from modern multiomic datasets.


Asunto(s)
Aprendizaje Automático , Humanos , Biología Computacional/métodos , Animales , Análisis de la Célula Individual/métodos , Algoritmos
2.
Ann Stat ; 48(1): 111-137, 2020 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-35847529

RESUMEN

The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X 1, … , X p ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

3.
BMC Genomics ; 16: 263, 2015 Apr 03.
Artículo en Inglés | MEDLINE | ID: mdl-25887568

RESUMEN

BACKGROUND: With the explosion of genomic data over the last decade, there has been a tremendous amount of effort to understand the molecular basis of cancer using informatics approaches. However, this has proven to be extremely difficult primarily because of the varied etiology and vast genetic heterogeneity of different cancers and even within the same cancer. One particularly challenging problem is to predict prognostic outcome of the disease for different patients. RESULTS: Here, we present ENCAPP, an elastic-net-based approach that combines the reference human protein interactome network with gene expression data to accurately predict prognosis for different human cancers. Our method identifies functional modules that are differentially expressed between patients with good and bad prognosis and uses these to fit a regression model that can be used to predict prognosis for breast, colon, rectal, and ovarian cancers. Using this model, ENCAPP can also identify prognostic biomarkers with a high degree of confidence, which can be used to generate downstream mechanistic and therapeutic insights. CONCLUSION: ENCAPP is a robust method that can accurately predict prognostic outcome and identify biomarkers for different human cancers.


Asunto(s)
Biomarcadores de Tumor/metabolismo , Neoplasias/diagnóstico , Neoplasias/metabolismo , Programas Informáticos , Biología Computacional , Expresión Génica , Humanos , Neoplasias/genética , Pronóstico , Mapas de Interacción de Proteínas
4.
Patterns (N Y) ; 3(5): 100473, 2022 May 13.
Artículo en Inglés | MEDLINE | ID: mdl-35607614

RESUMEN

High-dimensional cellular and molecular profiling of biological samples highlights the need for analytical approaches that can integrate multi-omic datasets to generate prioritized causal inferences. Current methods are limited by high dimensionality of the combined datasets, the differences in their data distributions, and their integration to infer causal relationships. Here, we present Essential Regression (ER), a novel latent-factor-regression-based interpretable machine-learning approach that addresses these problems by identifying latent factors and their likely cause-effect relationships with system-wide outcomes/properties of interest. ER can integrate many multi-omic datasets without structural or distributional assumptions regarding the data. It outperforms a range of state-of-the-art methods in terms of prediction. ER can be coupled with probabilistic graphical modeling, thereby strengthening the causal inferences. The utility of ER is demonstrated using multi-omic system immunology datasets to generate and validate novel cellular and molecular inferences in a wide range of contexts including immunosenescence and immune dysregulation.

5.
Neuroimage ; 55(4): 1519-27, 2011 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-21167288

RESUMEN

The goals of this paper are to review the most popular methods of predictor selection in regression models, to explain why some fail when the number P of explanatory variables exceeds the number N of participants, and to discuss alternative statistical methods that can be employed in this case. We focus on penalized least squares methods in regression models, and discuss in detail two such methods that are well established in the statistical literature, the LASSO and Elastic Net. We introduce bootstrap enhancements of these methods, the BE-LASSO and BE-Enet, that allow the user to attach a measure of uncertainty to each variable selected. Our work is motivated by a multimodal neuroimaging dataset that consists of morphometric measures (volumes at several anatomical regions of interest), white matter integrity measures from diffusion weighted data (fractional anisotropy, mean diffusivity, axial diffusivity and radial diffusivity) and clinical and demographic variables (age, education, alcohol and drug history). In this dataset, the number P of explanatory variables exceeds the number N of participants. We use the BE-LASSO and BE-Enet to provide the first statistical analysis that allows the assessment of neurocognitive performance from high dimensional neuroimaging and clinical predictors, including their interactions. The major novelty of this analysis is that biomarker selection and dimension reduction are accomplished with a view towards obtaining good predictions for the outcome of interest (i.e., the neurocognitive indices), unlike principal component analysis that are performed only on the predictors' space independently of the outcome of interest.


Asunto(s)
Encéfalo/patología , Trastornos del Conocimiento/etiología , Trastornos del Conocimiento/patología , Imagen de Difusión por Resonancia Magnética/métodos , Infecciones por VIH/complicaciones , Infecciones por VIH/patología , Interpretación de Imagen Asistida por Computador/métodos , Adulto , Anciano , Algoritmos , Femenino , Humanos , Aumento de la Imagen/métodos , Análisis de los Mínimos Cuadrados , Masculino , Persona de Mediana Edad , Análisis de Regresión , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
6.
iScience ; 14: 125-135, 2019 Apr 26.
Artículo en Inglés | MEDLINE | ID: mdl-30954780

RESUMEN

LOVE, a robust, scalable latent model-based clustering method for biological discovery, can be used across a range of datasets to generate both overlapping and non-overlapping clusters. In our formulation, a cluster comprises variables associated with the same latent factor and is determined from an allocation matrix that indexes our latent model. We prove that the allocation matrix and corresponding clusters are uniquely defined. We apply LOVE to biological datasets (gene expression, serological responses measured from HIV controllers and chronic progressors, vaccine-induced humoral immune responses) resulting in meaningful biological output. For all three datasets, the clusters generated by LOVE remain stable across tuning parameters. Finally, we compared LOVE's performance to that of 13 state-of-the-art methods using previously established benchmarks and found that LOVE outperformed these methods across datasets. Our results demonstrate that LOVE can be broadly used across large-scale biological datasets to generate accurate and meaningful overlapping and non-overlapping clusters.

7.
J Am Stat Assoc ; 111(514): 834-845, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-28042189

RESUMEN

We introduce a new sparse estimator of the covariance matrix for high-dimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator which tapers the sample covariance matrix by a Toeplitz, sparsely-banded, data-adaptive matrix. As a result of this adaptivity, the convex banding estimator enjoys theoretical optimality properties not attained by previous banding or tapered estimators. In particular, our convex banding estimator is minimax rate adaptive in Frobenius and operator norms, up to log factors, over commonly-studied classes of covariance matrices, and over more general classes. Furthermore, it correctly recovers the bandwidth when the true covariance is exactly banded. Our convex formulation admits a simple and efficient algorithm. Empirical studies demonstrate its practical effectiveness and illustrate that our exactly-banded estimator works well even when the true covariance matrix is only close to a banded matrix, confirming our theoretical results. Our method compares favorably with all existing methods, in terms of accuracy and speed. We illustrate the practical merits of the convex banding estimator by showing that it can be used to improve the performance of discriminant analysis for classifying sound recordings.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA