Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
1.
Bioinformatics ; 37(20): 3667-3669, 2021 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-33904580

RESUMO

SUMMARY: Biclustering is a generalization of clustering used to identify simultaneous grouping patterns in observations (rows) and features (columns) of a data matrix. Recently, the biclustering task has been formulated as a convex optimization problem. While this convex recasting of the problem has attractive properties, existing algorithms do not scale well. To address this problem and make convex biclustering a practical tool for analyzing larger data, we propose an implementation of fast convex biclustering called COBRAC to reduce the computing time by iteratively compressing problem size along with the solution path. We apply COBRAC to several gene expression datasets to demonstrate its effectiveness and efficiency. Besides the standalone version for COBRAC, we also developed a related online web server for online calculation and visualization of the downloadable interactive results. AVAILABILITY AND IMPLEMENTATION: The source code and test data are available at https://github.com/haidyi/cvxbiclustr or https://zenodo.org/record/4620218. The web server is available at https://cvxbiclustr.ericchi.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
Comput Sci Eng ; 23(6): 42-51, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35784398

RESUMO

Modern technologies produce a deluge of complicated data. In neuroscience, for example, minimally invasive experimental methods can take recordings of large populations of neurons at high resolution under a multitude of conditions. Such data arrays possess non-trivial interdependencies along each of their axes. Insights into these data arrays may lay the foundations of advanced treatments for nervous system disorders. The potential impacts of such data, however, will not be fully realized unless the techniques for analyzing them keep pace. Specifically, there is an urgent, growing need for methods for estimating the low-dimensional structure and geometry in big and noisy data arrays. This article reviews a framework for identifying complicated underlying patterns in such data and also recounts the key role that the Department of Energy Computational Sciences Graduate Fellowship played in setting the stage for this work to be done by the author.

3.
IEEE Signal Process Mag ; 37(6): 160-173, 2020 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-33473243

RESUMO

Graph signal processing (GSP) is an important methodology for studying data residing on irregular structures. As acquired data is increasingly taking the form of multi-way tensors, new signal processing tools are needed to maximally utilize the multi-way structure within the data. In this paper, we review modern signal processing frameworks generalizing GSP to multi-way data, starting from graph signals coupled to familiar regular axes such as time in sensor networks, and then extending to general graphs across all tensor modes. This widely applicable paradigm motivates reformulating and improving upon classical problems and approaches to creatively address the challenges in tensor-based data. We synthesize common themes arising from current efforts to combine GSP with tensor analysis and highlight future directions in extending GSP to the multi-way paradigm.

4.
Biometrics ; 73(1): 10-19, 2017 03.
Artigo em Inglês | MEDLINE | ID: mdl-27163413

RESUMO

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees-features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.


Assuntos
Análise por Conglomerados , Interpretação Estatística de Dados , Redes Reguladoras de Genes , Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos
5.
Genome Res ; 23(3): 509-18, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23233546

RESUMO

Most current genotype imputation methods are model-based and computationally intensive, taking days to impute one chromosome pair on 1000 people. We describe an efficient genotype imputation method based on matrix completion. Our matrix completion method is implemented in MATLAB and tested on real data from HapMap 3, simulated pedigree data, and simulated low-coverage sequencing data derived from the 1000 Genomes Project. Compared with leading imputation programs, the matrix completion algorithm embodied in our program MENDEL-IMPUTE achieves comparable imputation accuracy while reducing run times significantly. Implementation in a lower-level language such as Fortran or C is apt to further improve computational efficiency.


Assuntos
Inteligência Artificial , Genótipo , Modelos Genéticos , Software , Algoritmos , Simulação por Computador , Genoma Humano , Projeto HapMap , Humanos , Análise em Microsséries , Polimorfismo de Nucleotídeo Único
6.
PLoS Comput Biol ; 11(5): e1004228, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25965340

RESUMO

The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Humanos , Software
7.
Comput Stat Data Anal ; 80: 117-128, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25143662

RESUMO

Estimation of a covariance matrix or its inverse plays a central role in many statistical methods. For these methods to work reliably, estimated matrices must not only be invertible but also well-conditioned. The current paper introduces a novel prior to ensure a well-conditioned maximum a posteriori (MAP) covariance estimate. The prior shrinks the sample covariance estimator towards a stable target and leads to a MAP estimator that is consistent and asymptotically efficient. Thus, the MAP estimator gracefully transitions towards the sample covariance matrix as the number of samples grows relative to the number of covariates. The utility of the MAP estimator is demonstrated in two standard applications - discriminant analysis and EM clustering - in this sampling regime.

8.
Int Stat Rev ; 82(1): 46-70, 2014 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25242858

RESUMO

Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include nondifferentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience.

9.
Math Program ; 146: 409-436, 2014 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-25392563

RESUMO

The problem of minimizing a continuously differentiable convex function over an intersection of closed convex sets is ubiquitous in applied mathematics. It is particularly interesting when it is easy to project onto each separate set, but nontrivial to project onto their intersection. Algorithms based on Newton's method such as the interior point method are viable for small to medium-scale problems. However, modern applications in statistics, engineering, and machine learning are posing problems with potentially tens of thousands of parameters or more. We revisit this convex programming problem and propose an algorithm that scales well with dimensionality. Our proposal is an instance of a sequential unconstrained minimization technique and revolves around three ideas: the majorization-minimization principle, the classical penalty method for constrained optimization, and quasi-Newton acceleration of fixed-point algorithms. The performance of our distance majorization algorithms is illustrated in several applications.

10.
Am Math Mon ; 121(2): 95-108, 2014 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-25242816

RESUMO

In a recent issue of this journal, Mordukhovich, Nam, and Salinas pose and solve an interesting non-differentiable generalization of the Heron problem in the framework of modern convex analysis. In the generalized Heron problem, one is given k + 1 closed convex sets in ℝ d equipped with its Euclidean norm and asked to find the point in the last set such that the sum of the distances to the first k sets is minimal. In later work, the authors generalize the Heron problem even further, relax its convexity assumptions, study its theoretical properties, and pursue subgradient algorithms for solving the convex case. Here, we revisit the original problem solely from the numerical perspective. By exploiting the majorization-minimization (MM) principle of computational statistics and rudimentary techniques from differential calculus, we are able to construct a very fast algorithm for solving the Euclidean version of the generalized Heron problem.

11.
Technometrics ; 65(1): 117-126, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37448596

RESUMO

Building on previous research of Chi and Chi (2022), the current paper revisits estimation in robust structured regression under the L2E criterion. We adopt the majorization-minimization (MM) principle to design a new algorithm for updating the vector of regression coefficients. Our sharp majorization achieves faster convergence than the previous alternating proximal gradient descent algorithm (Chi and Chi, 2022). In addition, we reparameterize the model by substituting precision for scale and estimate precision via a modified Newton's method. This simplifies and accelerates overall estimation. We also introduce distance-to-set penalties to enable constrained estimation under nonconvex constraint sets. This tactic also improves performance in coefficient estimation and structure recovery. Finally, we demonstrate the merits of our improved tactics through a rich set of simulation examples and a real data application.

12.
J Comput Graph Stat ; 32(3): 938-949, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37822489

RESUMO

Proximal Markov Chain Monte Carlo is a novel construct that lies at the intersection of Bayesian computation and convex optimization, which helped popularize the use of nondifferentiable priors in Bayesian statistics. Existing formulations of proximal MCMC, however, require hyperparameters and regularization parameters to be prespecified. In this work, we extend the paradigm of proximal MCMC through introducing a novel new class of nondifferentiable priors called epigraph priors. As a proof of concept, we place trend filtering, which was originally a nonparametric regression problem, in a parametric setting to provide a posterior median fit along with credible intervals as measures of uncertainty. The key idea is to replace the nonsmooth term in the posterior density with its Moreau-Yosida envelope, which enables the application of the gradient-based MCMC sampler Hamiltonian Monte Carlo. The proposed method identifies the appropriate amount of smoothing in a data-driven way, thereby automating regularization parameter selection. Compared with conventional proximal MCMC methods, our method is mostly tuning free, achieving simultaneous calibration of the mean, scale and regularization parameters in a fully Bayesian framework.

13.
Technometrics ; 65(4): 537-552, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38213317

RESUMO

The growing prevalence of tensor data, or multiway arrays, in science and engineering applications motivates the need for tensor decompositions that are robust against outliers. In this paper, we present a robust Tucker decomposition estimator based on the L2 criterion, called the Tucker-L2E. Our numerical experiments demonstrate that Tucker-L2E has empirically stronger recovery performance in more challenging high-rank scenarios compared with existing alternatives. The appropriate Tucker-rank can be selected in a data-driven manner with cross-validation or hold-out validation. The practical effectiveness of Tucker-L2E is validated on real data applications in fMRI tensor denoising, PARAFAC analysis of fluorescence data, and feature extraction for classification of corrupted images.

14.
Stat Anal Data Min ; 15(3): 303-313, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35756358

RESUMO

Many machine learning algorithms depend on weights that quantify row and column similarities of a data matrix. The choice of weights can dramatically impact the effectiveness of the algorithm. Nonetheless, the problem of choosing weights has arguably not been given enough study. When a data matrix is completely observed, Gaussian kernel affinities can be used to quantify the local similarity between pairs of rows and pairs of columns. Computing weights in the presence of missing data, however, becomes challenging. In this paper, we propose a new method to construct row and column affinities even when data are missing by building off a co-clustering technique. This method takes advantage of solving the optimization problem for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. It exploits the coupled similarity structure among both the rows and columns of a data matrix. We show these affinities can be used to perform tasks such as data imputation, clustering, and matrix completion on graphs.

15.
J Comput Graph Stat ; 31(4): 1051-1062, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36721836

RESUMO

We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods with the L2 criterion. In addition to introducing an algorithm for performing L2E regression, our framework enables robust regression with the L2 criterion for additional structural constraints, works without requiring complex tuning procedures on the precision parameter, can be used to identify heterogeneous subpopulations, and can incorporate readily available non-robust structured regression solvers. We provide convergence guarantees for the framework and demonstrate its flexibility with some examples. Supplementary materials for this article are available online.

16.
J Comput Graph Stat ; 30(1): 115-124, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34025100

RESUMO

Joint models are popular for analyzing data with multivariate responses. We propose a sparse multivariate single index model, where responses and predictors are linked by unspecified smooth functions and multiple matrix level penalties are employed to select predictors and induce low-rank structures across responses. An alternating direction method of multipliers (ADMM) based algorithm is proposed for model estimation. We demonstrate the effectiveness of proposed model in simulation studies and an application to a genetic association study.

17.
Annu Int Conf IEEE Eng Med Biol Soc ; 2021: 4432-4435, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34892203

RESUMO

Coronary bifurcation lesions are a leading cause of Coronary Artery Disease (CAD). Despite its prevalence, coronary bifurcation lesions remain difficult to treat due to our incomplete understanding of how various features of lesion anatomy synergistically disrupt normal hemodynamic flow. In this work, we employ an interpretable machine learning algorithm, the Classification and Regression Tree (CART), to model the impact of these geometric features on local hemodynamic quantities. We generate a synthetic arterial database via computational fluid dynamic simulations and apply the CART approach to predict the time averaged wall shear stress (TAWSS) at two different locations within the cardiac vasculature. Our experimental results show that CART can estimate a simple, interpretable, yet accurately predictive nonlinear model of TAWSS as a function of such features.Clinical relevance- The fitted tree models have the potential to refine predictions of disturbed hemodynamic flow based on an individual's cardiac and lesion anatomy and consequently makes progress towards personalized treatment planning for CAD patients.


Assuntos
Doença da Artéria Coronariana , Hemodinâmica , Coração , Humanos , Aprendizado de Máquina , Estresse Mecânico
18.
Sci Rep ; 11(1): 8145, 2021 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-33854076

RESUMO

Conventional invasive diagnostic imaging techniques do not adequately resolve complex Type B and C coronary lesions, which present unique challenges, require personalized treatment and result in worsened patient outcomes. These lesions are often excluded from large-scale non-invasive clinical trials and there does not exist a validated approach to characterize hemodynamic quantities and guide percutaneous intervention for such lesions. This work identifies key biomarkers that differentiate complex Type B and C lesions from simple Type A lesions by introducing and validating a coronary angiography-based computational fluid dynamic (CFD-CA) framework for intracoronary assessment in complex lesions at ultrahigh resolution. Among 14 patients selected in this study, 7 patients with Type B and C lesions were included in the complex lesion group including ostial, bifurcation, serial lesions and lesion where flow was supplied by collateral bed. Simple lesion group included 7 patients with lesions that were discrete, [Formula: see text] long and readily accessible. Intracoronary assessment was performed using CFD-CA framework and validated by comparing to clinically measured pressure-based index, such as FFR. Local pressure, endothelial shear stress (ESS) and velocity profiles were derived for all patients. We validates the accuracy of our CFD-CA framework and report excellent agreement with invasive measurements ([Formula: see text]). Ultra-high resolution achieved by the model enable physiological assessment in complex lesions and quantify hemodynamic metrics in all vessels up to 1mm in diameter. Importantly, we demonstrate that in contrast to traditional pressure-based metrics, there is a significant difference in the intracoronary hemodynamic forces, such as ESS, in complex lesions compared to simple lesions at both resting and hyperemic physiological states [n = 14, [Formula: see text]]. Higher ESS was observed in the complex lesion group ([Formula: see text] Pa) than in simple lesion group ([Formula: see text] Pa). Complex coronary lesions have higher ESS compared to simple lesions, such differential hemodynamic evaluation can provide much the needed insight into the increase in adverse outcomes for such patients and has incremental prognostic value over traditional pressure-based indices, such as FFR.


Assuntos
Angiografia Coronária/métodos , Doença das Coronárias/diagnóstico por imagem , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Simulação por Computador , Doença das Coronárias/classificação , Diagnóstico Diferencial , Hemodinâmica , Humanos , Resistência ao Cisalhamento
19.
Stat ; 8(1)2020.
Artigo em Inglês | MEDLINE | ID: mdl-32655193

RESUMO

Canonical correlation analysis (CCA) is a multivariate analysis technique for estimating a linear relationship between two sets of measurements. Modern acquisition technologies, for example, those arising in neuroimaging and remote sensing, produce data in the form of multidimensional arrays or tensors. Classic CCA is not appropriate for dealing with tensor data due to the multidimensional structure and ultrahigh dimensionality of such modern data. In this paper, we present tensor CCA (TCCA) to discover relationships between two tensors while simultaneously preserving multidimensional structure of the tensors and utilizing substantially fewer parameters. Furthermore, we show how to employ a parsimonious covariance structure to gain additional stability and efficiency. We delineate population and sample problems for each model and propose efficient estimation algorithms with global convergence guarantees. Also we describe a probabilistic model for TCCA that enables the generation of synthetic data with desired canonical variates and correlations. Simulation studies illustrate the performance of our methods.

20.
Artigo em Inglês | MEDLINE | ID: mdl-33312074

RESUMO

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising "blessing of dimensionality" phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA