RESUMO
Semi-supervised deep clustering methods attract much attention due to their excellent performance on the end-to-end clustering task. However, it is hard to obtain satisfying clustering results since many overlapping samples in industrial text datasets strongly and incorrectly influence the learning process. Existing methods incorporate prior knowledge in the form of pairwise constraints or class labels, which not only largely ignore the correlation between these two supervision information but also cause the problem of weak-supervised constraint or incorrect strong-supervised label guidance. In order to tackle these problems, we propose a semi-supervised method based on pairwise constraints and subset allocation (PCSA-DEC). We redefine the similarity-based constraint loss by forcing the similarity of samples in the same class much higher than other samples and design a novel subset allocation loss to precisely learn strong-supervised information contained in labels which consistent with unlabeled data. Experimental results on the two industrial text datasets show that our method can yield 8.2%-8.7% improvement in accuracy and 13.4%-19.8% on normalized mutual information over the state-of-the-art method.
Assuntos
Gestão da Informação , Aprendizagem , Análise por ConglomeradosRESUMO
In many real applications of semi-supervised learning, the guidance provided by a human oracle might be "noisy" or inaccurate. Human annotators will often be imperfect, in the sense that they can make subjective decisions, they might only have partial knowledge of the task at hand, or they may simply complete a labeling task incorrectly due to the burden of annotation. Similarly, in the context of semi-supervised community finding in complex networks, information encoded as pairwise constraints may be unreliable or conflicting due to the human element in the annotation process. This study aims to address the challenge of handling noisy pairwise constraints in overlapping semi-supervised community detection, by framing the task as an outlier detection problem. We propose a general architecture which includes a process to "clean" or filter noisy constraints. Furthermore, we introduce multiple designs for the cleaning process which use different type of outlier detection models, including autoencoders. A comprehensive evaluation is conducted for each proposed methodology, which demonstrates the potential of the proposed architecture for reducing the impact of noisy supervision in the context of overlapping community detection.
RESUMO
In this paper, we introduce a neural network framework for semi-supervised clustering with pairwise (must-link or cannot-link) constraints. In contrast to existing approaches, we decompose semi-supervised clustering into two simpler classification tasks: the first stage uses a pair of Siamese neural networks to label the unlabeled pairs of points as must-link or cannot-link; the second stage uses the fully pairwise-labeled dataset produced by the first stage in a supervised neural-network-based clustering method. The proposed approach is motivated by the observation that binary classification (such as assigning pairwise relations) is usually easier than multi-class clustering with partial supervision. On the other hand, being classification-based, our method solves only well-defined classification problems, rather than less well specified clustering tasks. Extensive experiments on various datasets demonstrate the high performance of the proposed method.
Assuntos
Redes Neurais de Computação , Aprendizado de Máquina Supervisionado , Análise por Conglomerados , Bases de Dados Factuais/tendências , Aprendizado de Máquina Supervisionado/tendênciasRESUMO
Although support vector machine (SVM) has become a powerful tool for pattern classification and regression, a major disadvantage is it fails to exploit the underlying correlation between any pair of data points as much as possible. Inspired by the modified pairwise constraints trick, in this paper, we propose a novel classifier termed as support vector machine with hypergraph-based pairwise constraints to improve the performance of the classical SVM by introducing a new regularization term with hypergraph-based pairwise constraints (HPC). The new classifier is expected to not only learn the structural information of each point itself, but also acquire the prior distribution knowledge about each constrained pair by combining the discrimination metric and hypergraph learning together. Three major contributions of this paper can be summarized as follows: (1) acquiring the high-order relationships between different samples by hypergraph learning; (2) presenting a more reasonable discriminative regularization term by combining the discrimination metric and hypergraph learning; (3) improving the performance of the existing SVM classifier by introducing HPC regularization term. And the comprehensive experimental results on twenty-five datasets demonstrate the validity and advantage of our approach.
RESUMO
Concept factorization (CF) is a variant of non-negative matrix factorization (NMF). In CF, each concept is represented by a linear combination of data points, and each data point is represented by a linear combination of concepts. More specifically, each concept is represented by more than one data point with different weights, and each data point carries various weights called membership to represent their degrees belonging to that concept. However, CF is actually an unsupervised method without making use of prior information of the data. In this paper, we propose a novel semi-supervised concept factorization method, called Pairwise Constrained Concept Factorization (PCCF), which incorporates pairwise constraints into the CF framework. We expect that data points which have pairwise must-link constraints should have the same class label as much as possible, while data points with pairwise cannot-link constraints will have different class labels as much as possible. Due to the incorporation of the pairwise constraints, the learning quality of the CF has been significantly enhanced. Experimental results show the effectiveness of our proposed novel method in comparison to the state-of-the-art algorithms on several real world applications.