Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 70
Filter
1.
Neural Netw ; 170: 578-595, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38052152

ABSTRACT

Principal Component Analysis (PCA) and its nonlinear extension Kernel PCA (KPCA) are widely used across science and industry for data analysis and dimensionality reduction. Modern deep learning tools have achieved great empirical success, but a framework for deep principal component analysis is still lacking. Here we develop a deep kernel PCA methodology (DKPCA) to extract multiple levels of the most informative components of the data. Our scheme can effectively identify new hierarchical variables, called deep principal components, capturing the main characteristics of high-dimensional data through a simple and interpretable numerical optimization. We couple the principal components of multiple KPCA levels, theoretically showing that DKPCA creates both forward and backward dependency across levels, which has not been explored in kernel methods and yet is crucial to extract more informative features. Various experimental evaluations on multiple data types show that DKPCA finds more efficient and disentangled representations with higher explained variance in fewer principal components, compared to the shallow KPCA. We demonstrate that our method allows for effective hierarchical data exploration, with the ability to separate the key generative factors of the input data both for large datasets and when few training samples are available. Overall, DKPCA can facilitate the extraction of useful patterns from high-dimensional data by learning more informative features organized in different levels, giving diversified aspects to explore the variation factors in the data, while maintaining a simple mathematical formulation.


Subject(s)
Algorithms , Principal Component Analysis
2.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10044-10054, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37028385

ABSTRACT

Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on various tasks, including Corel, PASCAL VOC, Satellite, directed graphs, and UCI database, all show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that rely on symmetrization to accommodate asymmetric kernels.


Subject(s)
Algorithms , Learning , Least-Squares Analysis
3.
Neural Comput ; 34(10): 2009-2036, 2022 Sep 12.
Article in English | MEDLINE | ID: mdl-36027763

ABSTRACT

Disentanglement is a useful property in representation learning, which increases the interpretability of generative models such as variational autoencoders (VAE), generative adversarial models, and their many variants. Typically in such models, an increase in disentanglement performance is traded off with generation quality. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement by encouraging orthogonal directions of variations. The proposed objective is the sum of an autoencoder error term along with a principal component analysis reconstruction error in the feature space. This has an interpretation of a restricted kernel machine with the eigenvector matrix valued on the Stiefel manifold. Our analysis shows that such a construction promotes disentanglement by matching the principal directions in the latent space with the directions of orthogonal variation in data space. In an alternating minimization scheme, we use the Cayley ADAM algorithm, a stochastic optimization method on the Stiefel manifold along with the Adam optimizer. Our theoretical discussion and various experiments show that the proposed model is an improvement over many VAE variants in terms of both generation quality and disentangled representation learning.

4.
Article in English | MEDLINE | ID: mdl-35802546

ABSTRACT

Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this article, we focus on the problem of learning with noisy labels and introduce compression inductive bias to network architectures to alleviate this overfitting problem. More precisely, we revisit one classical regularization named Dropout and its variant Nested Dropout. Dropout can serve as a compression constraint for its feature dropping mechanism, while Nested Dropout further learns ordered feature representations with respect to feature importance. Moreover, the trained models with compression regularization are further combined with co-teaching for performance boost. Theoretically, we conduct bias variance decomposition of the objective function under compression regularization. We analyze it for both single model and co-teaching. This decomposition provides three insights: 1) it shows that overfitting is indeed an issue in learning with noisy labels; 2) through an information bottleneck formulation, it explains why the proposed feature compression helps in combating label noise; and 3) it gives explanations on the performance boost brought by incorporating compression regularization into co-teaching. Experiments show that our simple approach can have comparable or even better performance than the state-of-the-art methods on benchmarks with real-world label noise including Clothing1M and ANIMAL-10N. Our implementation is available at https://yingyichen-cyy.github.io/ CompressFeatNoisyLabels/.

5.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 7128-7148, 2022 Oct.
Article in English | MEDLINE | ID: mdl-34310285

ABSTRACT

The class of random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data. Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users' guide for practitioners interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions. Due to the page limit, we suggest the readers refer to the full version of this survey https://arxiv.org/abs/2004.11154.

6.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 7975-7988, 2022 Nov.
Article in English | MEDLINE | ID: mdl-34648434

ABSTRACT

In this paper, we develop a quadrature framework for large-scale kernel machines via a numerical integration representation. Considering that the integration domain and measure of typical kernels, e.g., Gaussian kernels, arc-cosine kernels, are fully symmetric, we leverage a numerical integration technique, deterministic fully symmetric interpolatory rules, to efficiently compute quadrature nodes and associated weights for kernel approximation. Thanks to the full symmetric property, the applied interpolatory rules are able to reduce the number of needed nodes while retaining a high approximation accuracy. Further, we randomize the above deterministic rules by the classical Monte-Carlo sampling and control variates techniques with two merits: 1) The proposed stochastic rules make the dimension of the feature mapping flexibly varying, such that we can control the discrepancy between the original and approximate kernels by tuning the dimnension. 2) Our stochastic rules have nice statistical properties of unbiasedness and variance reduction. In addition, we elucidate the relationship between our deterministic/stochastic interpolatory rules and current typical quadrature based rules for kernel approximation, thereby unifying these methods under our framework. Experimental results on several benchmark datasets show that our methods compare favorably with other representative kernel approximation based methods.

7.
IEEE Trans Neural Netw Learn Syst ; 33(11): 6373-6387, 2022 11.
Article in English | MEDLINE | ID: mdl-34048348

ABSTRACT

The adaptive hinging hyperplane (AHH) model is a popular piecewise linear representation with a generalized tree structure and has been successfully applied in dynamic system identification. In this article, we aim to construct the deep AHH (DAHH) model to extend and generalize the networking of AHH model for high-dimensional problems. The network structure of DAHH is determined through a forward growth, in which the activity ratio is introduced to select effective neurons and no connecting weights are involved between the layers. Then, all neurons in the DAHH network can be flexibly connected to the output in a skip-layer format, and only the corresponding weights are the parameters to optimize. With such a network framework, the backpropagation algorithm can be implemented in DAHH to efficiently tackle large-scale problems and the gradient vanishing problem is not encountered in the training of DAHH. In fact, the optimization problem of DAHH can maintain convexity with convex loss in the output layer, which brings natural advantages in optimization. Different from the existing neural networks, DAHH is easier to interpret, where neurons are connected sparsely and analysis of variance (ANOVA) decomposition can be applied, facilitating to revealing the interactions between variables. A theoretical analysis toward universal approximation ability and explicit domain partitions are also derived. Numerical experiments verify the effectiveness of the proposed DAHH.


Subject(s)
Algorithms , Neural Networks, Computer , Neurons/physiology , Brain
8.
Neural Netw ; 142: 661-679, 2021 Oct.
Article in English | MEDLINE | ID: mdl-34399376

ABSTRACT

We introduce Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled data representations. We propose augmenting the original deep restricted kernel machine formulation for kernel PCA by orthogonality constraints on the latent variables to promote disentanglement and to make it possible to carry out optimization without first defining a stabilized objective. After discussing a number of algorithms for end-to-end training, we quantitatively evaluate the proposed method's effectiveness in disentangled feature learning. We demonstrate on four benchmark datasets that this approach performs similarly overall to ß-VAE on several disentanglement metrics when few training points are available while being less sensitive to randomness and hyperparameter selection than ß-VAE. We also present a deterministic initialization of Constr-DRKM's training algorithm that significantly improves the reproducibility of the results. Finally, we empirically evaluate and discuss the role of the number of layers in the proposed methodology, examining the influence of each principal component in every layer and showing that components in lower layers act as local feature detectors capturing the broad trends of the data distribution, while components in deeper layers use the representation learned by previous layers and more accurately reproduce higher-level features.


Subject(s)
Algorithms , Unsupervised Machine Learning , Reproducibility of Results
9.
Neural Netw ; 135: 177-191, 2021 Mar.
Article in English | MEDLINE | ID: mdl-33395588

ABSTRACT

This paper introduces a novel framework for generative models based on Restricted Kernel Machines (RKMs) with joint multi-view generation and uncorrelated feature learning, called Gen-RKM. To enable joint multi-view generation, this mechanism uses a shared representation of data from various views. Furthermore, the model has a primal and dual formulation to incorporate both kernel-based and (deep convolutional) neural network based models within the same setting. When using neural networks as explicit feature-maps, a novel training procedure is proposed, which jointly learns the features and shared subspace representation. The latent variables are given by the eigen-decomposition of the kernel matrix, where the mutual orthogonality of eigenvectors represents the learned uncorrelated features. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of generated samples on various standard datasets.


Subject(s)
Deep Learning , Neural Networks, Computer , Pattern Recognition, Automated/methods , Photic Stimulation/methods , Humans
10.
Neural Netw ; 125: 1-9, 2020 May.
Article in English | MEDLINE | ID: mdl-32062409

ABSTRACT

Long Short-Term Memory (LSTM) has shown significant performance on many real-world applications due to its ability to capture long-term dependencies. In this paper, we utilize LSTM to obtain a data-driven forecasting model for an application of weather forecasting. Moreover, we propose Transductive LSTM (T-LSTM) which exploits the local information in time-series prediction. In transductive learning, the samples in the test point vicinity are considered to have higher impact on fitting the model. In this study, a quadratic cost function is considered for the regression problem. Localizing the objective function is done by considering a weighted quadratic cost function at which point the samples in the neighborhood of the test point have larger weights. We investigate two weighting schemes based on the cosine similarity between the training samples and the test point. In order to assess the performance of the proposed method in different weather conditions, the experiments are conducted on two different time periods of a year. The results show that T-LSTM results in better performance in the prediction task.


Subject(s)
Neural Networks, Computer , Weather
11.
IEEE Trans Neural Netw Learn Syst ; 31(8): 2965-2979, 2020 Aug.
Article in English | MEDLINE | ID: mdl-31514157

ABSTRACT

Random Fourier features (RFFs) have been successfully employed to kernel approximation in large-scale situations. The rationale behind RFF relies on Bochner's theorem, but the condition is too strict and excludes many widely used kernels, e.g., dot-product kernels (violates the shift-invariant condition) and indefinite kernels [violates the positive definite (PD) condition]. In this article, we present a unified RFF framework for indefinite kernel approximation in the reproducing kernel Krein spaces (RKKSs). Besides, our model is also suited to approximate a dot-product kernel on the unit sphere, as it can be transformed into a shift-invariant but indefinite kernel. By the Kolmogorov decomposition scheme, an indefinite kernel in RKKS can be decomposed into the difference of two unknown PD kernels. The spectral distribution of each underlying PD kernel can be formulated as a nonparametric Bayesian Gaussian mixtures model. Based on this, we propose a double-infinite Gaussian mixture model in RFF by placing the Dirichlet process prior. It takes full advantage of high flexibility on the number of components and has the capability of approximating indefinite kernels on a wide scale. In model inference, we develop a non-conjugate variational algorithm with a sub-sampling scheme for the posterior inference. It allows for the non-conjugate case in our model and is quite efficient due to the sub-sampling strategy. Experimental results on several large classification data sets demonstrate the effectiveness of our nonparametric Bayesian model for indefinite kernel approximation when compared to other representative random feature-based methods.

12.
PLoS One ; 14(6): e0217967, 2019.
Article in English | MEDLINE | ID: mdl-31173619

ABSTRACT

Kernel regression models have been used as non-parametric methods for fitting experimental data. However, due to their non-parametric nature, they belong to the so-called "black box" models, indicating that the relation between the input variables and the output, depending on the kernel selection, is unknown. In this paper we propose a new methodology to retrieve the relation between each input regressor variable and the output in a least squares support vector machine (LS-SVM) regression model. The method is based on oblique subspace projectors (ObSP), which allows to decouple the influence of input regressors on the output by including the undesired variables in the null space of the projection matrix. Such functional relations are represented by the nonlinear transformation of the input regressors, and their subspaces are estimated using appropriate kernel evaluations. We exploit the properties of ObSP in order to decompose the output of the obtained regression model as a sum of the partial nonlinear contributions and interaction effects of the input variables, we called this methodology Nonlinear ObSP (NObSP). We compare the performance of the proposed algorithm with the component selection and smooth operator (COSSO) for smoothing spline ANOVA models. We use as benchmark 2 toy examples and a real life regression model using the concrete strength dataset from the UCI machine learning repository. We showed that NObSP is able to outperform COSSO, producing stable estimations of the functional relations between the input regressors and the output, without the use of prior-knowledge. This methodology can be used in order to understand the functional relations between the inputs and the output in a regression model, retrieving the physical interpretation of the regression models.


Subject(s)
Least-Squares Analysis , Machine Learning , Models, Statistical , Support Vector Machine , Algorithms , Artificial Intelligence
13.
IEEE Trans Neural Netw Learn Syst ; 30(3): 765-776, 2019 Mar.
Article in English | MEDLINE | ID: mdl-30047906

ABSTRACT

In kernel methods, the kernels are often required to be positive definitethat restricts the use of many indefinite kernels. To consider those nonpositive definite kernels, in this paper, we aim to build an indefinite kernel learning framework for kernel logistic regression (KLR). The proposed indefinite KLR (IKLR) model is analyzed in the reproducing kernel Krein spaces and then becomes nonconvex. Using the positive decomposition of a nonpositive definite kernel, the derived IKLR model can be decomposed into the difference of two convex functions. Accordingly, a concave-convex procedure (CCCP) is introduced to solve the nonconvex optimization problem. Since the CCCP has to solve a subproblem in each iteration, we propose a concave-inexact-convex procedure (CCICP) algorithm with an inexact solving scheme to accelerate the solving process. Besides, we propose a stochastic variant of CCICP to efficiently obtain a proximal solution, which achieves the similar purpose with the inexact solving scheme in CCICP. The convergence analyses of the above-mentioned two variants of CCCP are conducted. By doing so, our method works effectively not only in a deterministic setting but also in a stochastic setting. Experimental results on several benchmarks suggest that the proposed IKLR model performs favorably against the standard (positive definite) KLR and other competitive indefinite learning-based algorithms.

14.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4621-4632, 2018 10.
Article in English | MEDLINE | ID: mdl-29990204

ABSTRACT

In pattern classification, polynomial classifiers are well-studied methods as they are capable of generating complex decision surfaces. Unfortunately, the use of multivariate polynomials is limited to kernels as in support-vector machines, because polynomials quickly become impractical for high-dimensional problems. In this paper, we effectively overcome the curse of dimensionality by employing the tensor train (TT) format to represent a polynomial classifier. Based on the structure of TTs, two learning algorithms are proposed, which involve solving different optimization problems of low computational complexity. Furthermore, we show how both regularization to prevent overfitting and parallelization, which enables the use of large training sets, are incorporated into these methods. The efficiency and efficacy of our tensor-based polynomial classifier are then demonstrated on the two popular data sets U.S. Postal Service and Modified NIST.

15.
IEEE Trans Neural Netw Learn Syst ; 29(5): 2025-2030, 2018 05.
Article in English | MEDLINE | ID: mdl-28362619

ABSTRACT

This brief proposes a truncated distance (TL1) kernel, which results in a classifier that is nonlinear in the global region but is linear in each subregion. With this kernel, the subregion structure can be trained using all the training data and local linear classifiers can be established simultaneously. The TL1 kernel has good adaptiveness to nonlinearity and is suitable for problems which require different nonlinearities in different areas. Though the TL1 kernel is not positive semidefinite, some classical kernel learning methods are still applicable which means that the TL1 kernel can be directly used in standard toolboxes by replacing the kernel evaluation. In numerical experiments, the TL1 kernel with a pregiven parameter achieves similar or better performance than the radial basis function kernel with the parameter tuned by cross validation, implying the TL1 kernel a promising nonlinear kernel for classification tasks.

16.
IEEE Trans Neural Netw Learn Syst ; 29(7): 3199-3213, 2018 07.
Article in English | MEDLINE | ID: mdl-28783648

ABSTRACT

Domain adaptation learning is one of the fundamental research topics in pattern recognition and machine learning. This paper introduces a regularized semipaired kernel canonical correlation analysis formulation for learning a latent space for the domain adaptation problem. The optimization problem is formulated in the primal-dual least squares support vector machine setting where side information can be readily incorporated through regularization terms. The proposed model learns a joint representation of the data set across different domains by solving a generalized eigenvalue problem or linear system of equations in the dual. The approach is naturally equipped with out-of-sample extension property, which plays an important role for model selection. Furthermore, the Nyström approximation technique is used to make the computational issues due to the large size of the matrices involved in the eigendecomposition feasible. The learned latent space of the source domain is fed to a multiclass semisupervised kernel spectral clustering model that can learn from both labeled and unlabeled data points of the source domain in order to classify the data instances of the target domain. Experimental results are given to illustrate the effectiveness of the proposed approaches on synthetic and real-life data sets.

17.
Entropy (Basel) ; 20(3)2018 Mar 06.
Article in English | MEDLINE | ID: mdl-33265262

ABSTRACT

This paper studies the matrix completion problems when the entries are contaminated by non-Gaussian noise or outliers. The proposed approach employs a nonconvex loss function induced by the maximum correntropy criterion. With the help of this loss function, we develop a rank constrained, as well as a nuclear norm regularized model, which is resistant to non-Gaussian noise and outliers. However, its non-convexity also leads to certain difficulties. To tackle this problem, we use the simple iterative soft and hard thresholding strategies. We show that when extending to the general affine rank minimization problems, under proper conditions, certain recoverability results can be obtained for the proposed algorithms. Numerical experiments indicate the improved performance of our proposed approach.

18.
Entropy (Basel) ; 20(4)2018 Apr 10.
Article in English | MEDLINE | ID: mdl-33265355

ABSTRACT

Entropy measures have been a major interest of researchers to measure the information content of a dynamical system. One of the well-known methodologies is sample entropy, which is a model-free approach and can be deployed to measure the information transfer in time series. Sample entropy is based on the conditional entropy where a major concern is the number of past delays in the conditional term. In this study, we deploy a lag-specific conditional entropy to identify the informative past values. Moreover, considering the seasonality structure of data, we propose a clustering-based sample entropy to exploit the temporal information. Clustering-based sample entropy is based on the sample entropy definition while considering the clustering information of the training data and the membership of the test point to the clusters. In this study, we utilize the proposed method for transductive feature selection in black-box weather forecasting and conduct the experiments on minimum and maximum temperature prediction in Brussels for 1-6 days ahead. The results reveal that considering the local structure of the data can improve the feature selection performance. In addition, despite the large reduction in the number of features, the performance is competitive with the case of using all features.

19.
IEEE Trans Neural Netw Learn Syst ; 29(8): 3863-3869, 2018 08.
Article in English | MEDLINE | ID: mdl-28650828

ABSTRACT

In this brief, kernel principal component analysis (KPCA) is reinterpreted as the solution to a convex optimization problem. Actually, there is a constrained convex problem for each principal component, so that the constraints guarantee that the principal component is indeed a solution, and not a mere saddle point. Although these insights do not imply any algorithmic improvement, they can be used to further understand the method, formulate possible extensions, and properly address them. As an example, a new convex optimization problem for semisupervised classification is proposed, which seems particularly well suited whenever the number of known labels is small. Our formulation resembles a least squares support vector machine problem with a regularization parameter multiplied by a negative sign, combined with a variational principle for KPCA. Our primal optimization principle for semisupervised learning is solved in terms of the Lagrange multipliers. Numerical experiments in several classification tasks illustrate the performance of the proposed model in problems with only a few labeled data.

20.
Neural Comput ; 29(8): 2123-2163, 2017 08.
Article in English | MEDLINE | ID: mdl-28562217

ABSTRACT

The aim of this letter is to propose a theory of deep restricted kernel machines offering new foundations for deep learning with kernel machines. From the viewpoint of deep learning, it is partially related to restricted Boltzmann machines, which are characterized by visible and hidden units in a bipartite graph without hidden-to-hidden connections and deep learning extensions as deep belief networks and deep Boltzmann machines. From the viewpoint of kernel machines, it includes least squares support vector machines for classification and regression, kernel principal component analysis (PCA), matrix singular value decomposition, and Parzen-type models. A key element is to first characterize these kernel machines in terms of so-called conjugate feature duality, yielding a representation with visible and hidden units. It is shown how this is related to the energy form in restricted Boltzmann machines, with continuous variables in a nonprobabilistic setting. In this new framework of so-called restricted kernel machine (RKM) representations, the dual variables correspond to hidden features. Deep RKM are obtained by coupling the RKMs. The method is illustrated for deep RKM, consisting of three levels with a least squares support vector machine regression level and two kernel PCA levels. In its primal form also deep feedforward neural networks can be trained within this framework.

SELECTION OF CITATIONS
SEARCH DETAIL