RESUMO
Catastrophic forgetting (CF) happens whenever a neural network overwrites past knowledge while being trained on new tasks. Common techniques to handle CF include regularization of the weights (using, e.g., their importance on past tasks), and rehearsal strategies, where the network is constantly re-trained on past data. Generative models have also been applied for the latter, in order to have endless sources of data. In this paper, we propose a novel method that combines the strengths of regularization and generative-based rehearsal approaches. Our generative model consists of a normalizing flow (NF), a probabilistic and invertible neural network, trained on the internal embeddings of the network. By keeping a single NF throughout the training process, we show that our memory overhead remains constant. In addition, exploiting the invertibility of the NF, we propose a simple approach to regularize the network's embeddings with respect to past tasks. We show that our method performs favorably with respect to state-of-the-art approaches in the literature, with bounded computational power and memory overheads.
Assuntos
Aprendizagem , Redes Neurais de Computação , Aprendizado de MáquinaRESUMO
In this article, we investigate the degree of explainability of graph neural networks (GNNs). The existing explainers work by finding global/local subgraphs to explain a prediction, but they are applied after a GNN has already been trained. Here, we propose a meta-explainer for improving the level of explainability of a GNN directly at training time, by steering the optimization procedure toward minima that allow post hoc explainers to achieve better results, without sacrificing the overall accuracy of GNN. Our framework (called MATE, MetA-Train to Explain) jointly trains a model to solve the original task, e.g., node classification, and to provide easily processable outputs for downstream algorithms that explain the model's decisions in a human-friendly way. In particular, we meta-train the model's parameters to quickly minimize the error of an instance-level GNNExplainer trained on-the-fly on randomly sampled nodes. The final internal representation relies on a set of features that can be ``better'' understood by an explanation algorithm, e.g., another instance of GNNExplainer. Our model-agnostic approach can improve the explanations produced for different GNN architectures and use any instance-based explainer to drive this process. Experiments on synthetic and real-world datasets for node and graph classification show that we can produce models that are consistently easier to explain by different algorithms. Furthermore, this increase in explainability comes at no cost to the accuracy of the model.
RESUMO
In this paper, we propose a novel ensembling technique for deep neural networks, which is able to drastically reduce the required memory compared to alternative approaches. In particular, we propose to extract multiple sub-networks from a single, untrained neural network by solving an end-to-end optimization task combining differentiable scaling over the original architecture, with multiple regularization terms favouring the diversity of the ensemble. Since our proposal aims to detect and extract sub-structures, we call it Structured Ensemble. On a large experimental evaluation, we show that our method can achieve higher or comparable accuracy to competing methods while requiring significantly less storage. In addition, we evaluate our ensembles in terms of predictive calibration and uncertainty, showing they compare favourably with the state-of-the-art. Finally, we draw a link with the continual learning literature, and we propose a modification of our framework to handle continuous streams of tasks with a sub-linear memory cost. We compare with a number of alternative strategies to mitigate catastrophic forgetting, highlighting advantages in terms of average accuracy and memory.
Assuntos
Aprendizagem , Redes Neurais de Computação , IncertezaRESUMO
Variational autoencoders are deep generative models that have recently received a great deal of attention due to their ability to model the latent distribution of any kind of input such as images and audio signals, among others. A novel variational autoncoder in the quaternion domain H, namely the QVAE, has been recently proposed, leveraging the augmented second order statics of H-proper signals. In this paper, we analyze the QVAE under an information-theoretic perspective, studying the ability of the H-proper model to approximate improper distributions as well as the built-in H-proper ones and the loss of entropy due to the improperness of the input signal. We conduct experiments on a substantial set of quaternion signals, for each of which the QVAE shows the ability of modelling the input distribution, while learning the improperness and increasing the entropy of the latent space. The proposed analysis will prove that proper QVAEs can be employed with a good approximation even when the quaternion input data are improper.
RESUMO
In this paper, we propose a new approach to train a deep neural network with multiple intermediate auxiliary classifiers, branching from it. These 'multi-exits' models can be used to reduce the inference time by performing early exit on the intermediate branches, if the confidence of the prediction is higher than a threshold. They rely on the assumption that not all the samples require the same amount of processing to yield a good prediction. In this paper, we propose a way to train jointly all the branches of a multi-exit model without hyper-parameters, by weighting the predictions from each branch with a trained confidence score. Each confidence score is an approximation of the real one produced by the branch, and it is calculated and regularized while training the rest of the model. We evaluate our proposal on a set of image classification benchmarks, using different neural models and early-exit stopping criteria.
RESUMO
Graph convolutional networks (GCNs) are a family of neural network models that perform inference on graph data by interleaving vertexwise operations and message-passing exchanges across nodes. Concerning the latter, two key questions arise: 1) how to design a differentiable exchange protocol (e.g., a one-hop Laplacian smoothing in the original GCN) and 2) how to characterize the tradeoff in complexity with respect to the local updates. In this brief, we show that the state-of-the-art results can be achieved by adapting the number of communication steps independently at every node. In particular, we endow each node with a halting unit (inspired by Graves' adaptive computation time [1]) that after every exchange decides whether to continue communicating or not. We show that the proposed adaptive propagation GCN (AP-GCN) achieves superior or similar results to the best proposed models so far on a number of benchmarks while requiring a small overhead in terms of additional parameters. We also investigate a regularization term to enforce an explicit tradeoff between communication and accuracy. The code for the AP-GCN experiments is released as an open-source library.
RESUMO
Missing data imputation (MDI) is the task of replacing missing values in a dataset with alternative, predicted ones. Because of the widespread presence of missing data, it is a fundamental problem in many scientific disciplines. Popular methods for MDI use global statistics computed from the entire dataset (e.g., the feature-wise medians), or build predictive models operating independently on every instance. In this paper we propose a more general framework for MDI, leveraging recent work in the field of graph neural networks (GNNs). We formulate the MDI task in terms of a graph denoising autoencoder, where each edge of the graph encodes the similarity between two patterns. A GNN encoder learns to build intermediate representations for each example by interleaving classical projection layers and locally combining information between neighbors, while another decoding GNN learns to reconstruct the full imputed dataset from this intermediate embedding. In order to speed-up training and improve the performance, we use a combination of multiple losses, including an adversarial loss implemented with the Wasserstein metric and a gradient penalty. We also explore a few extensions to the basic architecture involving the use of residual connections between layers, and of global statistics computed from the dataset to improve the accuracy. On a large experimental evaluation with varying levels of artificial noise, we show that our method is on par or better than several alternative imputation methods. On three datasets with pre-existing missing values, we show that our method is robust to the choice of a downstream classifier, obtaining similar or slightly higher results compared to other choices.
Assuntos
Redes Neurais de Computação , Confiabilidade dos DadosRESUMO
Neural networks are generally built by interleaving (adaptable) linear layers with (fixed) nonlinear activation functions. To increase their flexibility, several authors have proposed methods for adapting the activation functions themselves, endowing them with varying degrees of flexibility. None of these approaches, however, have gained wide acceptance in practice, and research in this topic remains open. In this paper, we introduce a novel family of flexible activation functions that are based on an inexpensive kernel expansion at every neuron. Leveraging several properties of kernel-based models, we propose multiple variations for designing and initializing these kernel activation functions (KAFs), including a multidimensional scheme allowing to nonlinearly combine information from different paths in the network. The resulting KAFs can approximate any mapping defined over a subset of the real line, either convex or non-convex. Furthermore, they are smooth over their entire domain, linear in their parameters, and they can be regularized using any known scheme, including the use of â1 penalties to enforce sparseness. To the best of our knowledge, no other known model satisfies all these properties simultaneously. In addition, we provide an overview on alternative techniques for adapting the activation functions, which is currently lacking in the literature. A large set of experiments validates our proposal.
Assuntos
Bases de Dados Factuais , Redes Neurais de Computação , Neurônios , Algoritmos , Bases de Dados Factuais/estatística & dados numéricos , Neurônios/fisiologiaRESUMO
Random vector functional-link (RVFL) networks are randomized multilayer perceptrons with a single hidden layer and a linear output layer, which can be trained by solving a linear modeling problem. In particular, they are generally trained using a closed-form solution of the (regularized) least-squares approach. This paper introduces several alternative strategies for performing full Bayesian inference (BI) of RVFL networks. Distinct from standard or classical approaches, our proposed Bayesian training algorithms allow to derive an entire probability distribution over the optimal output weights of the network, instead of a single pointwise estimate according to some given criterion (e.g., least-squares). This provides several known advantages, including the possibility of introducing additional prior knowledge in the training process, the availability of an uncertainty measure during the test phase, and the capability of automatically inferring hyper-parameters from given data. In this paper, two BI algorithms for regression are first proposed that, under some practical assumptions, can be implemented by a simple iterative process with closed-form computations. Simulation results show that one of the proposed algorithms, Bayesian RVFL, is able to outperform standard training algorithms for RVFL networks with a proper regularization factor selected carefully via a line search procedure. A general strategy based on variational inference is also presented, with an application to data modeling problems with noisy outputs or outliers. As we discuss in this paper, using recent advances in automatic differentiation this strategy can be applied to a wide range of additional situations in an immediate fashion.
RESUMO
Distributed learning refers to the problem of inferring a function when the training data are distributed among different nodes. While significant work has been done in the contexts of supervised and unsupervised learning, the intermediate case of Semi-supervised learning in the distributed setting has received less attention. In this paper, we propose an algorithm for this class of problems, by extending the framework of manifold regularization. The main component of the proposed algorithm consists of a fully distributed computation of the adjacency matrix of the training patterns. To this end, we propose a novel algorithm for low-rank distributed matrix completion, based on the framework of diffusion adaptation. Overall, the distributed Semi-supervised algorithm is efficient and scalable, and it can preserve privacy by the inclusion of flexible privacy-preserving mechanisms for similarity computation. The experimental results and comparison on a wide range of standard Semi-supervised benchmarks validate our proposal.
RESUMO
The semi-supervised support vector machine (S(3)VM) is a well-known algorithm for performing semi-supervised inference under the large margin principle. In this paper, we are interested in the problem of training a S(3)VM when the labeled and unlabeled samples are distributed over a network of interconnected agents. In particular, the aim is to design a distributed training protocol over networks, where communication is restricted only to neighboring agents and no coordinating authority is present. Using a standard relaxation of the original S(3)VM, we formulate the training problem as the distributed minimization of a non-convex social cost function. To find a (stationary) solution in a distributed manner, we employ two different strategies: (i) a distributed gradient descent algorithm; (ii) a recently developed framework for In-Network Nonconvex Optimization (NEXT), which is based on successive convexifications of the original problem, interleaved by state diffusion steps. Our experimental results show that the proposed distributed algorithms have comparable performance with respect to a centralized implementation, while highlighting the pros and cons of the proposed solutions. To the date, this is the first work that paves the way toward the broad field of distributed semi-supervised learning over networks.
Assuntos
Aprendizado de Máquina Supervisionado , Máquina de Vetores de Suporte , AlgoritmosRESUMO
We approach the problem of forecasting the load of incoming calls in a cell of a mobile network using Echo State Networks. With respect to previous approaches to the problem, we consider the inclusion of additional telephone records regarding the activity registered in the cell as exogenous variables, by investigating their usefulness in the forecasting task. Additionally, we analyze different methodologies for training the readout of the network, including two novel variants, namely ν-SVR and an elastic net penalty. Finally, we employ a genetic algorithm for both the tasks of tuning the parameters of the system and for selecting the optimal subset of most informative additional time-series to be considered as external inputs in the forecasting problem. We compare the performances with standard prediction models and we evaluate the results according to the specific properties of the considered time-series.
Assuntos
Telefone Celular/estatística & dados numéricos , Redes Neurais de Computação , Algoritmos , Redes de Comunicação de Computadores , Previsões , Aprendizado de Máquina , Modelos TeóricosRESUMO
The functional link adaptive filter (FLAF) represents an effective solution for online nonlinear modeling problems. In this paper, we take into account a FLAF-based architecture, which separates the adaptation of linear and nonlinear elements, and we focus on the nonlinear branch to improve the modeling performance. In particular, we propose a new model that involves an adaptive combination of filters downstream of the nonlinear expansion. Such combination leads to a cooperative behavior of the whole architecture, thus yielding a performance improvement, particularly in the presence of strong nonlinearities. An advanced architecture is also proposed involving the adaptive combination of multiple filters on the nonlinear branch. The proposed models are assessed in different nonlinear modeling problems, in which their effectiveness and capabilities are shown.
Assuntos
Redes Neurais de Computação , Dinâmica não Linear , Algoritmos , Humanos , Modelos TeóricosRESUMO
The extreme learning machine (ELM) was recently proposed as a unifying framework for different families of learning algorithms. The classical ELM model consists of a linear combination of a fixed number of nonlinear expansions of the input vector. Learning in ELM is hence equivalent to finding the optimal weights that minimize the error on a dataset. The update works in batch mode, either with explicit feature mappings or with implicit mappings defined by kernels. Although an online version has been proposed for the former, no work has been done up to this point for the latter, and whether an efficient learning algorithm for online kernel-based ELM exists remains an open problem. By explicating some connections between nonlinear adaptive filtering and ELM theory, in this brief, we present an algorithm for this task. In particular, we propose a straightforward extension of the well-known kernel recursive least-squares, belonging to the kernel adaptive filtering (KAF) family, to the ELM framework. We call the resulting algorithm the kernel online sequential ELM (KOS-ELM). Moreover, we consider two different criteria used in the KAF field to obtain sparse filters and extend them to our context. We show that KOS-ELM, with their integration, can result in a highly efficient algorithm, both in terms of obtained generalization error and training time. Empirical evaluations demonstrate interesting results on some benchmarking datasets.
RESUMO
This paper introduces an Independent Component Analysis (ICA) approach to the separation of nonlinear mixtures in the complex domain. Source separation is performed by a complex INFOMAX approach. The neural network which realizes the separation employs the so called "Mirror Model" and is based on adaptive activation functions, whose shape is properly modified during learning. Nonlinear functions involved in the processing of complex signals are realized by pairs of spline neurons called "splitting functions", working on the real and the imaginary part of the signal respectively. Theoretical proof of existence and uniqueness of the solution under proper assumptions is also provided. In particular a simple adaptation algorithm is derived and some experimental results that demonstrate the effectiveness of the proposed solution are shown.
Assuntos
Redes Neurais de Computação , Dinâmica não Linear , Processamento de Sinais Assistido por Computador , AlgoritmosRESUMO
This paper presents a new general neural structure based on nonlinear flexible multivariate function that can be viewed in the framework of the generalised regularisation networks theory. The proposed architecture is based on multi-dimensional adaptive cubic spline basis activation function that collects information from the previous network layer in aggregate form. In other words, each activation function represents a spline function of a subset of previous layer outputs so the number of network connections (structural complexity) can be very low with respect to the problem complexity. A specific learning algorithm, based on the adaptation of local parameters of the activation function, is derived. This fact improve the network generalisation capabilities and speed up the convergence of the learning process. At last, some experimental results demonstrating the effectiveness of the proposed architecture, are presented.