RESUMO
Due to the difficulty of collecting labeled images for hundreds of thousands of visual categories, zero-shot learning, where unseen categories do not have any labeled images in training stage, has attracted more attention. In the past, many studies focused on transferring knowledge from seen to unseen categories by projecting all category labels into a semantic space. However, the label embeddings could not adequately express the semantics of categories. Furthermore, the common semantics of seen and unseen instances cannot be captured accurately because the distribution of these instances may be quite different. For these issues, we propose a novel deep semisupervised method by jointly considering the heterogeneity gap between different modalities and the correlation among unimodal instances. This method replaces the original labels with the corresponding textual descriptions to better capture the category semantics. This method also overcomes the problem of distribution difference by minimizing the maximum mean discrepancy between seen and unseen instance distributions. Extensive experimental results on two benchmark data sets, CU200-Birds and Oxford Flowers-102, indicate that our method achieves significant improvements over previous methods.
RESUMO
Robust principal component analysis (PCA) is one of the most important dimension-reduction techniques for handling high-dimensional data with outliers. However, most of the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the average of data as the optimal mean of robust PCA. In fact, this assumption holds only for the squared [Formula: see text]-norm-based traditional PCA. In this letter, we equivalently reformulate the objective of conventional PCA and learn the optimal projection directions by maximizing the sum of projected difference between each pair of instances based on [Formula: see text]-norm. The proposed method is robust to outliers and also invariant to rotation. More important, the reformulated objective not only automatically avoids the calculation of optimal mean and makes the assumption of centered data unnecessary, but also theoretically connects to the minimization of reconstruction error. To solve the proposed nonsmooth problem, we exploit an efficient optimization algorithm to soften the contributions from outliers by reweighting each data point iteratively. We theoretically analyze the convergence and computational complexity of the proposed algorithm. Extensive experimental results on several benchmark data sets illustrate the effectiveness and superiority of the proposed method.
RESUMO
Few-shot learning (FSL) poses a significant challenge in classifying unseen classes with limited samples, primarily stemming from the scarcity of data. Although numerous generative approaches have been investigated for FSL, their generation process often results in entangled outputs, exacerbating the distribution shift inherent in FSL. Consequently, this considerably hampers the overall quality of the generated samples. Addressing this concern, we present a pioneering framework called DisGenIB, which leverages an Information Bottleneck (IB) approach for Disentangled Generation. Our framework ensures both discrimination and diversity in the generated samples, simultaneously. Specifically, we introduce a groundbreaking Information Theoretic objective that unifies disentangled representation learning and sample generation within a novel framework. In contrast to previous IB-based methods that struggle to leverage priors, our proposed DisGenIB effectively incorporates priors as invariant domain knowledge of sub-features, thereby enhancing disentanglement. This innovative approach enables us to exploit priors to their full potential and facilitates the overall disentanglement process. Moreover, we establish the theoretical foundation that reveals certain prior generative and disentanglement methods as special instances of our DisGenIB, underscoring the versatility of our proposed framework. To solidify our claims, we conduct comprehensive experiments on demanding FSL benchmarks, affirming the remarkable efficacy and superiority of DisGenIB. Furthermore, the validity of our theoretical analyses is substantiated by the experimental results. Our code is available at https://github.com/eric-hang/DisGenIB.
RESUMO
Weakly supervised person search involves training a model with only bounding box annotations, without human-annotated identities. Clustering algorithms are commonly used to assign pseudo-labels to facilitate this task. However, inaccurate pseudo-labels and imbalanced identity distributions can result in severe label and sample noise. In this work, we propose a novel Collaborative Contrastive Refining (CCR) weakly-supervised framework for person search that jointly refines pseudo-labels and the sample-learning process with different contrastive strategies. Specifically, we adopt a hybrid contrastive strategy that leverages both visual and context clues to refine pseudo-labels, and leverage the sample-mining and noise-contrastive strategy to reduce the negative impact of imbalanced distributions by distinguishing positive samples and noise samples. Our method brings two main advantages: 1) it facilitates better clustering results for refining pseudo-labels by exploring the hybrid similarity; 2) it is better at distinguishing query samples and noise samples for refining the sample-learning process. Extensive experiments demonstrate the superiority of our approach over the state-of-the-art weakly supervised methods by a large margin (more than 3% mAP on CUHK-SYSU). Moreover, by leveraging more diverse unlabeled data, our method achieves comparable or even better performance than the state-of-the-art supervised methods.
RESUMO
The rich content in various real-world networks such as social networks, biological networks, and communication networks provides unprecedented opportunities for unsupervised machine learning on graphs. This paper investigates the fundamental problem of preserving and extracting abundant information from graph-structured data into embedding space without external supervision. To this end, we generalize conventional mutual information computation from vector space to graph domain and present a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graph and hidden representation. Except for standard GMI which considers graph structures from a local perspective, our further proposed GMI++ additionally captures global topological properties by analyzing the co-occurrence relationship of nodes. GMI and its extension exhibit several benefits: First, they are invariant to the isomorphic transformation of input graphs-an inevitable constraint in many existing methods; Second, they can be efficiently estimated and maximized by current mutual information estimation methods; Lastly, our theoretical analysis confirms their correctness and rationality. With the aid of GMI, we develop an unsupervised embedding model and adapt it to the specific anomaly detection task. Extensive experiments indicate that our GMI methods achieve promising performance in various downstream tasks, such as node classification, link prediction, and anomaly detection.
RESUMO
An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training still need to be detected. We design an end-to-end deep transferable network TN-ZSTAD as the architecture for this solution. On the one hand, this network utilizes an activity graph transformer to predict a set of activity instances that appear in the video, rather than produces many activity proposals in advance. On the other hand, this network captures the common semantics of seen and unseen activities from their corresponding label embeddings, and it is optimized with an innovative loss function that considers the classification property on seen activities and the transfer property on unseen activities together. Experiments on the THUMOS'14, Charades, and ActivityNet datasets show promising performance in terms of detecting unseen activities.
RESUMO
Zero-shot object detection (ZSD), the task that extends conventional detection models to detecting objects from unseen categories, has emerged as a new challenge in computer vision. Most existing approaches on ZSD are based on a strict mapping-transfer strategy that learns a mapping function from visual to semantic space over seen categories, then directly generalizes the learned mapping function to unseen object detection. However, the ZSD task still remains challenging, since those works fail to consider the two key factors that hamper the ZSD performance: (a) the domain shift problem between seen and unseen classes leads to poor transferable ability of the model; (b) the original visual feature space is suboptimal for ZSD since it lacks discriminative information.To alleviate these issues, we develop a novel Semantics-Guided Contrastive Network for ZSD (ContrastZSD), a detection framework that first brings the contrastive learning paradigm into the realm of ZSD. The pairwise contrastive tasks take advantage of class label and semantic relation as additional supervision signals. Under the guidance of those explicit semantic supervision, the model can learn more knowledge about unseen categories to avoid over-fitting to the seen concepts.
RESUMO
Most existing object detection models are restricted to detecting objects from previously seen categories, an approach that tends to become infeasible for rare or novel concepts. Accordingly, in this paper, we explore object detection in the context of zero-shot learning, i.e., Zero-Shot Object Detection (ZSD), to concurrently recognize and localize objects from novel concepts. Existing ZSD algorithms are typically based on a simple mapping-transfer strategy that is susceptible to the domain shift problem. To resolve this problem, we propose a novel Semantics-Preserving Graph Propagation model for ZSD based on Graph Convolutional Networks (GCN). More specifically, we employ a graph construction module to flexibly build category graphs by incorporating diverse correlations between category nodes; this is followed by two semantics preserving modules that enhance both category and region representations through a multi-step graph propagation process. Compared to existing mapping-transfer based methods, both the semantic description and semantic structural knowledge exhibited in prior category graphs can be effectively leveraged to boost the generalization capability of the learned projection function via knowledge transfer, thereby providing a solution to the domain shift problem. Experiments on existing seen/unseen splits of three popular object detection datasets demonstrate that the proposed approach performs favorably against state-of-the-art ZSD methods.
RESUMO
The cloud-based media streaming service is a promising paradigm for multimedia applications. It is attractive to media streaming service providers, who wish to deploy their media server clusters in a media cloud at reduced cost. Since the real-time live streaming service is both a bandwidth-intensive and quality-sensitive application, how to optimize the internal bandwidth utilization of a data center network (DCN) as well as guarantee the external bandwidth of the real-time live streaming application, is a key issue of deploying virtual machine (VM)-hosted media server cluster in a media cloud. Therefore, in this study, we propose an external-bandwidth-guaranteed media server cluster deployment scheme in media cloud. The approach simultaneously considers the outside bandwidth requirement of a tree-based media server cluster for live streaming and the intra-bandwidth consumption of a DCN. The proposed scheme models the optimal problem as a new terminal-Steiner-tree-like problem and provides an approximate algorithm for placing the media servers. Our evaluation results show that the proposed scheme guarantees the external bandwidth requirement of a real-time live streaming application, at the same time, greatly reduces the intra-bandwidth utilization of a media cloud with different DCN structures.
Assuntos
Armazenamento e Recuperação da Informação , Computação em Nuvem , Meios de Comunicação , Gravação em VídeoRESUMO
Semisupervised learning aims to leverage both labeled and unlabeled data to improve performance, where most of them are graph-based methods. However, the graph-based semisupervised methods are not capable for large-scale data since the computational consumption on the construction of graph Laplacian matrix is huge. On the other hand, the substantial unlabeled data in training stage of semisupervised learning could cause large uncertainties and potential threats. Therefore, it is crucial to enhance the robustness of semisupervised classification. In this paper, a novel large-scale robust semisupervised learning method is proposed in the framework of capped l2,p -norm. This strategy is superior not only in computational cost because it makes the graph Laplacian matrix unnecessary, but also in robustness to outliers since the capped l2,p -norm used for loss measurement. An efficient optimization algorithm is exploited to solve the nonconvex and nonsmooth challenging problem. The complexity of the proposed algorithm is analyzed and discussed in theory detailedly. Finally, extensive experiments are conducted over six benchmark data sets to demonstrate the effectiveness and superiority of the proposed method.
RESUMO
Spectral clustering plays a significant role in applications that rely on multi-view data due to its well-defined mathematical framework and excellent performance on arbitrarily-shaped clusters. Unfortunately, directly optimizing the spectral clustering inevitably results in an NP-hard problem due to the discrete constraints on the clustering labels. Hence, conventional approaches intuitively include a relax-and-discretize strategy to approximate the original solution. However, there are no principles in this strategy that prevent the possibility of information loss between each stage of the process. This uncertainty is aggravated when a procedure of heterogeneous features fusion has to be included in multi-view spectral clustering. In this paper, we avoid an NP-hard optimization problem and develop a general framework for multi-view discrete graph clustering by directly learning a consensus partition across multiple views, instead of using the relax-and-discretize strategy. An effective re-weighting optimization algorithm is exploited to solve the proposed challenging problem. Further, we provide a theoretical analysis of the model's convergence properties and computational complexity for the proposed algorithm. Extensive experiments on several benchmark datasets verify the effectiveness and superiority of the proposed algorithm on clustering and image segmentation tasks.
RESUMO
Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, and labeling these data are dramatically expensive and time-consuming, unsupervised feature selection has become a ubiquitous and challenging problem. Without label information, the fundamental problem of unsupervised feature selection lies in how to characterize the geometry structure of original feature space and produce a faithful feature subset, which preserves the intrinsic structure accurately. In this paper, we characterize the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multiconnected-components (multicluster) structure by imposing a rank constraint on the corresponding Laplacian matrix. To achieve a desirable feature subset, we learn the optimal reconstruction graph and selective matrix simultaneously, instead of using a predetermined graph. We exploit an efficient alternative optimization algorithm to solve the proposed challenging problem, together with the theoretical analyses on its convergence and computational complexity. Finally, extensive experiments on clustering task are conducted over several benchmark data sets to verify the effectiveness and superiority of the proposed unsupervised feature selection algorithm.
RESUMO
Video semantic recognition usually suffers from the curse of dimensionality and the absence of enough high-quality labeled instances, thus semisupervised feature selection gains increasing attentions for its efficiency and comprehensibility. Most of the previous methods assume that videos with close distance (neighbors) have similar labels and characterize the intrinsic local structure through a predetermined graph of both labeled and unlabeled data. However, besides the parameter tuning problem underlying the construction of the graph, the affinity measurement in the original feature space usually suffers from the curse of dimensionality. Additionally, the predetermined graph separates itself from the procedure of feature selection, which might lead to downgraded performance for video semantic recognition. In this paper, we exploit a novel semisupervised feature selection method from a new perspective. The primary assumption underlying our model is that the instances with similar labels should have a larger probability of being neighbors. Instead of using a predetermined similarity graph, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. Moreover, an adaptive loss function is exploited to measure the label fitness, which significantly enhances model's robustness to videos with a small or substantial loss. We propose an efficient alternating optimization algorithm to solve the proposed challenging problem, together with analyses on its convergence and computational complexity in theory. Finally, extensive experimental results on benchmark datasets illustrate the effectiveness and superiority of the proposed approach on video semantic recognition related tasks.